David,
At 08:11 PM 7/24/2003, you wrote:
Continuing on an earlier problem, I have an XML file that has an element
which will have "escaped" XML content. David Carlisle helped me
discover the "disable-output-escaping" attribute of "xsl:value-of",
which gives me a valid tree fragment.
Actually this is not the case. It writes a *string* to output -- the same
as it does otherwise. Only the usual escaping of nettlesome characters like
"<" and "&" (which would break well-formedness if output raw) into their
entity-reference forms &_lt; and &_amp; (underscores added), is disabled.
Consequently, what you get in your output is a "valid tree fragment" only
in the sense that it may *turn out to be*, in the serialized output form,
XML, which could be *then* be parsed into an XML tree.
If your string is something like:
"Here's my XML string with a < character. (Not!)"
when it's serialized in this form, it creates output that is *not* a
well-formed XML snippet and cannot be parsed by an XML parser. (My string
says 'not' since it's lying when it says it's XML.)
This is only one of many reasons that the d-o-e feature is Not Recommended.
It puts you at risk of getting unparseable output if you had any garbage in
your input.
Now, I need to convert that tree fragment to a nodeset so I can operate
on it.
Okay....
I noticed the "xalan:nodeset" (and "exslt:node-set") function. I assume
this takes a tree fragment and returns a nodeset.
Yes, but it takes a "result tree fragment" (see the FAQ or the XSLT Rec on
what this is), not a string, so it's not useful to you.
Using Xalan, you may be hosed -- unless you care to pipeline your XML
through another parse:
parse/transform #1 -- escapes output on "XML" strings in serialized result
parse/transform #2 -- does whatever processing you want to do
If your XML doesn't break between these two steps (because you fed
malformed strings to process #1), process #2 will see a real tree fragment
there.
Saxon has an extension function that lets you feed such strings to a parser
from within the transform. That's what you need if you want to do this in
one pass.
(But what you really need is real XML input: that won't be subject to
breakage. Since you're getting pseudo-XML input you can't tell whether it's
actually going to work until you try it. You may need a tidier in your
pipeline.)
It's bad design, as the RSS folks are discovering, to mix escaped markup
into XML and expect it to be picked up later, since it muddies the divide
between XML-as-character-string and XML-as-model. Although some designers
think it's a feature that the embedded "markup" doesn't have to be
well-formed, this is a trap: they are just putting themselves back in the
world where they have to wage an endless campaign to clean up the bad
markup they asked for. Like saying they want to live in a Japanese house
since it's cleaner, but then letting people wear shoes anyhow. (And
complaining "I thought this house was supposed to stay clean!")
It's just easier to assure the stuff is well-formed at the point of
creation, not to rely on markup-escaping to try to sneak it in as markup
later. And there are tools that can help with this. Unfortunately, if
you're hired as the guy to come in and clean the house after the party, you
may not be in a position to make no-shoes rules.
Good luck,
Wendell
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list