Re: [xsl] Maintaining character entities

Subject: Re: [xsl] Maintaining character entities
From: David Carlisle <davidc@xxxxxxxxx>
Date: Tue, 20 May 2003 10:04:33 +0100
  I've got XML documents, marked up to a DTD, and calling character entity
  sets. When I run through the XSLT processor (xalan) to output another XML
  file I find the entities have been converted to something different, and
  fairly inconsistently. 

Entities are expanded by the XML parser (probably xerces in your case)
before the XML application (xalan) sees the data.
So they are all gone by the time your stylesheet starts, and nothing you
can do can preserve them. Tjis is intentional behaviour, entities are
supposed to be an _authoring_ macro system and the behaviour of the
document is supposed to be the same whether the author uses the entity
shorthand or the full form, by having the parser replace all of the
entities at the start, consistent behaviour is ensured.

> What I would like to achieve is having &ldquo; &uuml; in my input xml, and
> these entities still being untouched in my output. Can anyone advise how I
> achieve this please?

You can not do that but you can control whether characters are output as
themselves or as entity references or as numerical character references.

If you output as html then most xslt systems will use "& u u m l;" and
friends on output whether or not the entity was used on input.

In XML output, if your processor supports an output encoding (eg ascii)
that does not have the characters, then these characters will be output
as numeric references & # ... ;

Some processors have extension options that give more control, not sure
about xalan though.

> What I'm getting are (&amp;ldquo;, &amp;uuml;),

You should never get that as input from a single character, only if you
input that form (either as &amp;ldquo; or equivalently
<![CDATA[&ldquo;]]> which means the same thing)

>  (ââ,B,Å? (Band Ã,CB¼(B),
That is utf8 which (unlike the entities or latin-1 is understood by all
XML processors, so this is actually the best, most portable output to

>  (&#8220;
That is also portable, and as I say above is the expected output if you
specify an encoding that does not include the character.

Given that all XML processors are mandated to understand 2 of teh 3
outputs that you say you got, why do you need the entities?


This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:

 XSL-List info and archive:

Current Thread