RE: [xsl] character entities

Subject: RE: [xsl] character entities
From: Kevin Rodgers <kevin.rodgers@xxxxxxx>
Date: Thu, 28 Apr 2005 15:16:44 -0600
Edward Bryant writes:
> >The &amp; represents an ampersand; it's seen within the XSLT
> >stylesheet as a single character (for example, string-length() is 1);
> >but if you serialize the result as XML or HTML then it will be
> >>output as &amp; because that's how an ampersand is represented in
> >XML and HTML.
> 
> I get that the "&amp;" is the unicode for an ampersand, but what do
> you mean by "serialize" ?

Think of it this way:

The XML input is parsed to yield a document tree of element, attribute,
text, processing instruction, comment, and namespace nodes.

The XSLT stylesheet is parsed (as XML), to yield a document tree of
element nodes (declarations and instructions), etc.

The XSLT semantics are applied to the XML tree, to yield the result
tree.

The result tree is output as a sequence of characters, according to the
xsl:output declaration.  This is called serialization (of the tree).

> So, if a character reference is in an XML source file it will show up
> as a reference in an XHTML output file (I got the impression from
> other posts that the XSLT would change the reference into the actual
> character)?

Right.  It doesn't really matter whether the markup character was
originally represented in the XML input as a character reference, an
entity reference, or as a data character within a CDATA section; it will
be changed into the actual character during parsing; then the xml, html,
and xhtml output methods will serialize it as an entity (or character)
reference.  And any element content or attributes in the XSLT stylesheet
that have been copied to the result tree will be serialized in the same
way.

> >You can't create character references in an XSLT template.
> 
> So what is the accepted way to add character references to the output?
> Would I have to run some kind of find-and-replace script after the
> XSLT transformation? What do other people do?

Avoid generating entity or character references in the output: If your
output encoding (e.g. UTF-8) has as its domain the entire XML character
set (Unicode), then any character can just be output in that encoding
(whether it's a single- or multi-byte sequence) and doesn't need to be
escaped as a reference.

> I came across "xmlchar" at XML.com. I didn't what to use it, but it changes 
> an element into a character reference. Looking at their stylesheets, I don't 
> understand how that works but my attempt to change <quote></quote> into the 
> #8220 and #8221 entities won't?

Why not just output those characters directly?

But if you're bent on outputting the sequence of characters "&#8220;",
then maybe that string can be represented in your stylesheet as
"&amp;#8220;".

-- 
Kevin Rodgers

Current Thread