cknell@xxxxxxxxxx wrote:
If you are using the UTF-8 encoding, for examle, the ó character is represented by ó
Actually, the encoding doesn't matter--what matters is the character
set, which is always Unicode for XML.
That is, the character ó (Latin small letter o with acute) is that
character in the Unicode character set regardless of the encoding.
Characters are an abstraction. A character set is nothing more than an
arbitrary mapping of abstract characters to unique numbers by which
those characters can be referenced. In Unicode, each character also has
a unique name that can be used instead of the character code to refer to
the character (although no all processors know how to resolve these names).
The encoding simply determines how the characters are written to disk as
sequences of bytes. For example, in UTF-8 encoding this character is
written as a single byte (because its code is less than 255, the point
at which UTF-8 uses 3 or more bytes per character), but the UTF-16
encoding is written as two bytes because UTF-16 uses two bytes for each
of the first 65K characters of the Unicode Basic Multilingual Plane. In
both cases the character (the abstraction) is the same: lowercase o with
acute.
To read an XML file, the XML processor must first read the sequence of
bytes on disk and then interpret that byte sequence as a sequence of
characters. Therefore, it must know the encoding because the same
sequence of bytes may result in different characters (or be invalid)
depending on the encoding it is interpreted as.
Cheers,
Eliot
--
W. Eliot Kimber
Innodata Isogen
eliot@xxxxxxxxxx
www.isogen.com
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list