Re: [xsl] RSS feeds and disable-output-escaping="yes"

Subject: Re: [xsl] RSS feeds and disable-output-escaping="yes"
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Fri, 06 May 2005 11:36:12 -0400
Don,

The scenario you describe is precisely *why* pseudo-markup hidden inside markup isn't considered a Good Thing. If it were honest XML, not escaped, then transforming it would be routine. Since it isn't, you have to find a way (XSLT or not) to convert the pseudo-markup into real markup (for example by unescaping it, or coding parser logic) before you can handle it as real markup. This process is difficult and error prone; also it requires you have a strategy for handling pseudo-markup that, when treated as markup, turns out to be not well-formed.

So say you have

<p>Here's my HTML &lt;b&gt;bold pseudo markup!&lt;/b&gt; Except it's gnarly, since I'm writing about XML entities and in particular the hazards of the '&amp;' character.</p> (Good luck reading this if your mailer "supports" markup!)

Running this through disable-output-escaping (one of the easier ways to unescape characters) the content of that paragraph comes out

"Here's my HTML <b>bold pseudo markup!</b> Except it's gnarly, since I'm writing about XML entities and in particular the hazards of the '&' character."

... which is fine as far as the HTML bold goes, but not as far as the '&' character is concerned (it still needs to be escaped).

If you have a well-bounded and well-defined (constrained) set of input documents, these kinds of problems are often tractable -- for example, a parsing pass over the output would locate this problem so you could fix it. (Unfortunately such fixes, generally speaking, have to be done by hand, since the machine can't tell which escaped stuff should be unescaped and which not.) If your input is not well-bounded or well-defined (which, alas, characterizes RSS almost by definition) -- all bets are off.

Accordingly, I'd try this only if I could fix something upstream that would assure me that pseudo-markup would always be "clean" (which includes double-escaping stuff that I actually want to be escaped) -- and if I had that kind of control, I might see if I couldn't get real XML (some of which could be escaped again for consumers that perversely expect pseudo-markup ;-), in which case the problem simply goes away.

If I didn't, pre-processing would probably be the next choice, along with a strategy for handling anything that came through broken. Or bringing in a vendor extension (Saxon has one) might provide a way around it (depending on how it deals with wf errors).

The whole pseudo-markup thing really troubles me. I understand why it's done, and I can imagine there may be use-cases in which it might be the best available alternative (though I'm not sure I've ever seen one). Things always get tricky when a language has to describe its own constructs. (While Bill had had "had", Mike had had "had had". Had the language had "had had had", we'd really be in trouble.) But mixing tag sets in an instance? I thought that's what namespaces were for.

Cheers,
Wendell

Greetings,
I have set up a Drupal sub-site and would like the RSS feed from the
site to be displayed as a 'whats new' panel on our main page.

Easy little XSL script I thought. The RSS feed has all the html in the
'description' tag escaped. I have used disable-output-escaping="yes" to
display the html, but I really need to be able to manipulate some of the
tags - the img tags in particular - I'd like to either remove or reduce
the width of the images (it is mostly user documentation for the
WebOPAC).

Is there any way I can do this or do I need to pre-process the rss feed
before I feed it into the XSL transformer thingy.


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread