Re: [xsl] Using accented characters in XML

Subject: Re: [xsl] Using accented characters in XML
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 18 May 2001 13:12:11 -0600 (MDT)
Alex Black wrote:
> The reason it's failing, is because that character need to be encoded as an
> 'entity' - I think I have an entities list around here somewhere.
> 
> anyway, that character should be encoded as &#201;

That's not an entity, nor an entity reference. It is a character reference.

> _watch_out_ with your entities, though - I was trying to use &nbsp; (the
> ever present space in html) and my xslt processor barfed on it. I think
> sablot has trouble with named entities. I'm not sure if that's a global
> problem with xslt processors.

&nbsp; is an entity reference. It presumes there is an entity named nbsp
that has been defined. In XML there are only 5 predefined entities and thus
you can only reference those 5: lt, gt, amp, quot and apos. If you want more
you have to declare them in a DTD. XSL is XML, so, this applies equally well
to stylesheets, not just "data" XML documents. You could think of it like
this: the entity is the replacement text. The entity reference is where you
want the text to go. It is the XML parser that makes this substitution, 
before your XSL processor ever sees it.

The complete set of standard character entities as used in HTML and other 
SGML applications, provided in the form of convenient declarations ripe for 
inclusion in a DTD, can be found at
http://www.oasis-open.org/cover/xml-ISOents.txt

To answer the original question, it is an encoding issue that will be solved
if he makes sure that his XML document properly declares its actual
encoding. I think he's leaving off the encoding declaration and it is
defaulting to UTF-8, when in fact the file is iso-8859-1 encoded. As the 
bytes for Éditez are read in, this is what happens:

É  d  i  t  e  z
C9 64 69 74 65 7A  <== actual bytes in the file
\   / |  |  |  |
 \ /  |  |  |  |   <== when interpreted as utf-8...
  |   |  |  |  | 
  |   i  t  e  z   <== are these characters.
  |_______________ <== The first 2 bytes are an invalid utf-8 sequence.
                       The 2nd byte would have to be between 80 and BF for 
                       the pair to represent Unicode code points between
                       U+0240 and U+027F (some non-characters and some
                       obscure Latin characters, not what was intended)

Perhaps the XML parser chose to substitute a "?" for the invalid utf-8
sequence, whereas it should have kicked out a fatal error. It's possible
that he fed the parser a character stream (instead of bytes) in which the
substitution had already been made. If it is the former,
<?xml version="1.0" encoding="iso-8859-1"?> in his XML document will fix the 
problem.

   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread