Re: [xsl] serializing and parsing xml in xslt

Subject: Re: [xsl] serializing and parsing xml in xslt
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 06 Jul 2006 18:03:58 -0400
Erik,

At 05:13 PM 7/6/2006, you wrote:
i ran across a problem where someone wants to basically serialize and parse xml in xslt. the idea is to "package" some xml sub-trees in attribute values as strings (with escaped markup) and then in another transformation to re-use this text as xml. and all this in xslt 1.0...

i know that this is a weird idea, but i found it challenging to think about how this could be done. i thought that one would basically have to write one's own serializer and parser in xslt, which probably is rather expensive. am i missing some better way to solve this problem?

This is kind of an interesting thought experiment, but "weird" seems a kind way of putting it to me.


Attribute values are really not a good place to store XML. That's what XML is for.

An analogy, if you like, would be making a black-and-white line drawing (an attribute value) of a color picture (XML) and then wanting to derive from it what the original colors were. It's possible to distinguish, through conventional means (various styles of cross-hatching etc.) different "colors" in black and white, and even to process such line art back into a color version. It is even possible to imagine odd sorts of scenarios where one might wish to do this. But in a system that can represent color (XML), dumbing down to black-and-white (attribute values, which require parsing to become more than simply strings) seems both gratuitously difficult and comically unnecessary, like (another analogy) eating your dinner with a drinking straw when you have knife and fork at the place setting.

In the old days (SGML), people used to tweak their system to use different markup delimiters to do this kind of snake-eating-its-tail-thing, as in:

[DOCUMENT]
[TITLE]Bigco Home Page
[VERSION]1997 01
[AUTHOR]Bill Little
[CONTENT DOCTYPE="PUBLIC '-//W3C//DTD HTML 3.2 Final//EN'"]
  <html>
    <head><title>Bigco Home Page</title>
    <body><p>We at Bigco are committed ... </p> ... </body>
  </html>
[/DOCUMENT]

As you can see, the one markup format is a wrapper for the other, and an SGML system would be fully capable of handling it. (And don't you like those nice implicit end tags.)

XML did away with declaring your own markup delimiters, to make parsers small and (in comparison to SGML) easy to write. In many ways this made life much easier. It did make this kind of magic more difficult.

The XML solution to the shutting down of this avenue (or thorn forest) was the same solution it offered to the arbitrary mixing of tag sets: namespaces.

While namespaces present their own puzzles to the mortal human brain, in XML they really are preferable as a way of representing one form of XML inside another. They don't hide your subordinate markup from first-pass parsing the way these other methods do (including hiding your pseudo-XML inside attributes), but that might actually be an advantage. They do make it possible to keep the processing of the two (or more) types of markup distinct from one another, which it seems to me is the main thing.

As to how it could be done better if you really went this route: well, there are systems (including Saxon; see the saxon:parse() extension function) that offer the option of offering strings to an XML parser and returning temporary trees, thus handling the problem in the next layer down, which is a much better place to do it than natively in XSLT. Short of this, one could always design a pipeline to write out string values as files and pick them up with a parser the old-fashioned way.

There is also something of a literature on regular-expression processing of markup, and along these lines, XSLT's own capabilities for arbitrary parsing of any markup syntax whatever will be (are) considerably stronger in version 2.0. And yes, there are those of us interested in this. But personally I'm much more interested in parsing other syntaxes besides XML, for which I have parsers.

Cheers,
Wendell

Current Thread