Erik,
At 05:13 PM 7/6/2006, you wrote:
i ran across a problem where someone wants to basically serialize
and parse xml in xslt. the idea is to "package" some xml sub-trees
in attribute values as strings (with escaped markup) and then in
another transformation to re-use this text as xml. and all this in xslt 1.0...
i know that this is a weird idea, but i found it challenging to
think about how this could be done. i thought that one would
basically have to write one's own serializer and parser in xslt,
which probably is rather expensive. am i missing some better way to
solve this problem?
This is kind of an interesting thought experiment, but "weird" seems
a kind way of putting it to me.
Attribute values are really not a good place to store XML. That's
what XML is for.
An analogy, if you like, would be making a black-and-white line
drawing (an attribute value) of a color picture (XML) and then
wanting to derive from it what the original colors were. It's
possible to distinguish, through conventional means (various styles
of cross-hatching etc.) different "colors" in black and white, and
even to process such line art back into a color version. It is even
possible to imagine odd sorts of scenarios where one might wish to do
this. But in a system that can represent color (XML), dumbing down to
black-and-white (attribute values, which require parsing to become
more than simply strings) seems both gratuitously difficult and
comically unnecessary, like (another analogy) eating your dinner with
a drinking straw when you have knife and fork at the place setting.
In the old days (SGML), people used to tweak their system to use
different markup delimiters to do this kind of
snake-eating-its-tail-thing, as in:
[DOCUMENT]
[TITLE]Bigco Home Page
[VERSION]1997 01
[AUTHOR]Bill Little
[CONTENT DOCTYPE="PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'"]
<html>
<head><title>Bigco Home Page</title>
<body><p>We at Bigco are committed ... </p> ... </body>
</html>
[/DOCUMENT]
As you can see, the one markup format is a wrapper for the other, and
an SGML system would be fully capable of handling it. (And don't you
like those nice implicit end tags.)
XML did away with declaring your own markup delimiters, to make
parsers small and (in comparison to SGML) easy to write. In many ways
this made life much easier. It did make this kind of magic more difficult.
The XML solution to the shutting down of this avenue (or thorn
forest) was the same solution it offered to the arbitrary mixing of
tag sets: namespaces.
While namespaces present their own puzzles to the mortal human brain,
in XML they really are preferable as a way of representing one form
of XML inside another. They don't hide your subordinate markup from
first-pass parsing the way these other methods do (including hiding
your pseudo-XML inside attributes), but that might actually be an
advantage. They do make it possible to keep the processing of the two
(or more) types of markup distinct from one another, which it seems
to me is the main thing.
As to how it could be done better if you really went this route:
well, there are systems (including Saxon; see the saxon:parse()
extension function) that offer the option of offering strings to an
XML parser and returning temporary trees, thus handling the problem
in the next layer down, which is a much better place to do it than
natively in XSLT. Short of this, one could always design a pipeline
to write out string values as files and pick them up with a parser
the old-fashioned way.
There is also something of a literature on regular-expression
processing of markup, and along these lines, XSLT's own capabilities
for arbitrary parsing of any markup syntax whatever will be (are)
considerably stronger in version 2.0. And yes, there are those of us
interested in this. But personally I'm much more interested in
parsing other syntaxes besides XML, for which I have parsers.
Cheers,
Wendell