Re: [xsl] Output: XML to XML scrambling unicode characters

Subject: Re: [xsl] Output: XML to XML scrambling unicode characters
From: David Carlisle <davidc@xxxxxxxxx>
Date: Mon, 4 Mar 2002 22:15:40 GMT
> If I use these symbols, I must add "&" before and ";" after. It was
> my assumption that "& #233;" was not any different than these. This
> is the reason why I called "& #233; a utf-8 rendering of "e acute".

No, this is not so. You can access a space by " " and a tab by "
" doing &#9; is just an XML reference to these characters but the
character data after the XML parse is the same. Actually in the case of
white space the rules are a bit different as white space normalisation
can affect end cases but for a non white space character like e acute
then if used in character data you never need to se a character
reference if it is in the encoding.

> is the reason why I called "& #233; a utf-8 rendering of "e acute".
Doing so leads to confusion though.
text encodings relate to the text stream and do not relate to XML syntax
at all. So for example latin1 (iso-8859-1) is an encoding in which every
character takes up at most one byte, and some positions are unencoded so
there's just over a couple of hundred characters available. Enough for
western Europe, mostly. If you have a plain text latin1 file you are
restricted to just using those characters, and if you want to write say,
Polish, you'd have to switch to a different encoding (latin2).
However in XML you can , whatever encoding the file is in, always refer
to any of the characters in unicode (ie numbers up to hex 10FFFF) using
the &# notation, however this notation always uses the same unicode
numbers and so is independent of the encoding used (utf8, latin1, etc)
except of course it depends on the encoding used for the symbols
&;#x0-9a-fA-F which are actually used in the syntax.

So if you want to force your processor to use &# syntax\ as much as
possible you need to specify an encododing that includes as few
characters as possible.

The default utf8 encoding includes all of unicode,
some processors let you use iso-88591 on output in which case anything
above xFF will have to be output using &# notation.
Some let you use us-ascii in which case everything above 127 will do
that.

Note however if your XML file uses  any of these characters in element
names such encodings can not be used, you can not use <&#233;> as an
element, so the text encoding used must include all characters used in
element and attribute names.

David




_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread