Re: more encoded questions

Subject: Re: more encoded questions
From: Mike Brown <mike@xxxxxxxx>
Date: Mon, 6 Nov 2000 22:51:36 -0700 (MST)
Josef Vosyka wrote:
> Characters are being rendered according to
>         a) input encoding
>         b) input form (escaped/non-escaped)


  the xml document is typically a bit sequence like

  these represent ISO/IEC 10646-1:1993 (UCS) (~Unicode) characters like

  <?xml version="1.0" encoding="utf-8"?>
    <element attribute="cdata">character&#20;data</element>

  this mapping of bits to UCS characters is the encoding (essentially).
  the encoding declaration in the XML declaration is only for helping to
  determine the encoding. once the document is decoded, it is irrelevant.    
  it is at that point all UCS characters.

  after decoding the document, the xml parser resolves character and
  certain entity references, turning them into UCS characters too.
  in the example above, &#20; becomes the space character.

  the UCS characters at this level imply the logical structures:
  elements, attributes, character data. these structures are reported
  by the parser to the application (the XSLT processor).

  so you see, you can say &#20; or &#x14; or refer to an entity that
  you defined as the space character, or put the encoded bits for the
  character into the binary document ... it doesn't matter; it all means
  the same thing, once it goes through the parser. the XSLT processor
  only knows about the single space character that was meant, not the 5
  characters '&#20;'. those were just 'physical' markup.

  now consider that the stylesheet is itself an xml document that
  is parsed just like the source document. the xslt processor acts
  on the logical structures. the stylesheet is not a literal
  specification for output. it is only a representation of how to
  build the result tree. character references in the stylesheet are
  just an abstraction for the individual characters that will actually
  be manipulated by the processor.

  the stylesheet's instructions result in the creation of a node
  tree -- the result tree. depending on what you put in the xsl:output
  element's 'method' and 'encoding' attributes, this tree will be
  serialized in different ways. the serialization for xml and html
  output methods will be as bits in the given encoding. the method
  might affect whether, say, UCS character 160 (non-breaking space)
  is output as the encoded bits for the single character number 160,
  or as the encoded bits for the character sequence '&nbsp;', or as
  the encoded bits for the character sequence '&#160;' or '&#xA0;'.

I wrote a lot about this at because I
was disappointed that XML books make very little effort to address these
issues. Concepts like encoding and logical structures should come first.
Syntax and code samples come last, and are almost inconsequential, once
you understand the principles at work. Instead, everyone teaches these
things backward, and you end up with situations like this, where your
impression of the meaning of a character reference is shaped by the way
HTML user agents behave(d).

I think you are under the impression that character references are related
to the encoding of the document. They are not. They are by definition, in
both HTML and XML, references to characters in one specific repertoire.

   - Mike
Mike J. Brown, software engineer at         My XML/XSL resources: in Denver, Colorado, USA 

 XSL-List info and archive:

Current Thread