Understanding character handling

Subject: Understanding character handling
From: Paul Prescod <paul@xxxxxxxxxxx>
Date: Thu, 07 Jan 1999 14:27:09 -0600
The root of this problem is the lack of a data model in the XML
specification. Those of us with SGML background understand the data model
implicitly, and many others have picked it up, but some obviously have not
yet. Some of us pushed very hard for a data model in the XML
specification. Instead it appears in the DOM *and* XSL *and* XPointer
*and* ...

I will use Python/OQL syntax to try and explain this in terms of the XSL
data model. (there is no syntax for the data model)

Consider the parsing process. It builds a grove from text:

&lt;  -> DataChar( "<" )
<![CDATA[<]]> -> DataChar( "<" )
<![CDATA[&]]> -> DataChar( "&" )
<![CDATA[&foo;]]> -> [DataChar( "&" ),DataChar( "f" ), DataChar( "o" ...
<FOO>a...</FOO> -> Element( gi = "FOO", content=[DataChar( "a" ), ... ] )

Now consider the serialization of XML. It builds text from a grove. But it
has many options because there are many equivalent serializations for a
given character:

DataChar( "<" ) -> &lt;
                -> <![CDATA[<]]>
                -> &#60;
                -> &#x3c;

Element( gi = "FOO", content=[DataChar( "a" ), ... ] )
      -> <FOO>a...</FOO>

Now consider the (logically identical) XSL templates:

<FOO><![CDATA[&foo;]]></FOO>
<FOO>&amp;foo;</FOO>

When the stylesheet is parsed either one becomes:

Element( 
   gi = "FOO", 
   content = [DataChar( "&" ),DataChar( "f" ), DataChar( "o" ...] 
)

The encoding of the ampersand is irrelevant. Now this is a literal result
element, with literal text within it. So it is copied to the output tree
like this:

Element( 
   gi = "FOO", 
   content = [DataChar( "&" ),DataChar( "f" ), DataChar( "o" ...] 
)

In other words, it is identical. Now if you go back to the serialization
model above, you'll see that the correct serialization for this *as an XML
file* is:

<FOO>abc</FOO>

Get it? The reason this is tricky is:

 a) there are about four steps between the input and the output
 b) XSL's syntax tricks you into thinking you are working with strings
when you are really working with trees
 c) The data model is expressed in the wrong place
 d) There is no syntax for talking about the data model (other than
Python/OQL)

 Paul Prescod  - ISOGEN Consulting Engineer speaking for only himself
 http://itrc.uwaterloo.ca/~papresco

"You have the wrong number."
"Eh? Isn't that the Odeon?"
"No, this is the Great Theater of Life. Admission is free, but the 
taxation is mortal. You come when you can, and leave when you must. The 
show is continuous. Good-night." -- Robertson Davies, "The Cunning Man"


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread