Re: [xsl] Convert escaped XML content to a treefrag, and then to a nodeset

Subject: Re: [xsl] Convert escaped XML content to a treefrag, and then to a nodeset
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Fri, 25 Jul 2003 13:00:38 -0400
David,

At 08:11 PM 7/24/2003, you wrote:
Continuing on an earlier problem, I have an XML file that has an element
which will have "escaped" XML content.  David Carlisle helped me
discover the "disable-output-escaping" attribute of "xsl:value-of",
which gives me a valid tree fragment.

Actually this is not the case. It writes a *string* to output -- the same as it does otherwise. Only the usual escaping of nettlesome characters like "<" and "&" (which would break well-formedness if output raw) into their entity-reference forms &_lt; and &_amp; (underscores added), is disabled.


Consequently, what you get in your output is a "valid tree fragment" only in the sense that it may *turn out to be*, in the serialized output form, XML, which could be *then* be parsed into an XML tree.

If your string is something like:

"Here's my XML string with a < character. (Not!)"

when it's serialized in this form, it creates output that is *not* a well-formed XML snippet and cannot be parsed by an XML parser. (My string says 'not' since it's lying when it says it's XML.)

This is only one of many reasons that the d-o-e feature is Not Recommended. It puts you at risk of getting unparseable output if you had any garbage in your input.

Now, I need to convert that tree fragment to a nodeset so I can operate
on it.

Okay....


I noticed the "xalan:nodeset" (and "exslt:node-set") function.  I assume
this takes a tree fragment and returns a nodeset.

Yes, but it takes a "result tree fragment" (see the FAQ or the XSLT Rec on what this is), not a string, so it's not useful to you.


Using Xalan, you may be hosed -- unless you care to pipeline your XML through another parse:

parse/transform #1 -- escapes output on "XML" strings in serialized result
parse/transform #2 -- does whatever processing you want to do

If your XML doesn't break between these two steps (because you fed malformed strings to process #1), process #2 will see a real tree fragment there.

Saxon has an extension function that lets you feed such strings to a parser from within the transform. That's what you need if you want to do this in one pass.

(But what you really need is real XML input: that won't be subject to breakage. Since you're getting pseudo-XML input you can't tell whether it's actually going to work until you try it. You may need a tidier in your pipeline.)

It's bad design, as the RSS folks are discovering, to mix escaped markup into XML and expect it to be picked up later, since it muddies the divide between XML-as-character-string and XML-as-model. Although some designers think it's a feature that the embedded "markup" doesn't have to be well-formed, this is a trap: they are just putting themselves back in the world where they have to wage an endless campaign to clean up the bad markup they asked for. Like saying they want to live in a Japanese house since it's cleaner, but then letting people wear shoes anyhow. (And complaining "I thought this house was supposed to stay clean!")

It's just easier to assure the stuff is well-formed at the point of creation, not to rely on markup-escaping to try to sneak it in as markup later. And there are tools that can help with this. Unfortunately, if you're hired as the guy to come in and clean the house after the party, you may not be in a position to make no-shoes rules.

Good luck,
Wendell



======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread