Re: Certain chars break transformation process. Guru wanted!

Subject: Re: Certain chars break transformation process. Guru wanted!
From: Mike Brown <mike@xxxxxxxx>
Date: Wed, 23 Aug 2000 15:45:14 -0600 (MDT)
Braun Online wrote:
> Where have I gone wrong?  Whenever I try to perform an XSLT transformation 
> on an XML file which has the registered symbol (?), the XSLT programes read 
> the symbol as two characters (? and ?)

Your XML file is using an encoding that is different from what the XML
parser (which feeds info about the document to the XSLT processor) thinks
it has. Is there an encoding="..." specification in the <?xml ...?> line
at the beginning of the file? What does it say? What created the file? Was
it a simple text editor that didn't give you the option of selecting an
encoding/character set to use?

Most likely what you see as the circle-R in your editor is stored on disk
as a single byte, 0xAE, which is how that character is represented in
iso-8859-1 and cp1252. The XML parser is following rules outlined in the
XML spec for deciding what character set was used to encode that XML, and
is decoding the bytes accordingly. It is probably deciding utf-8 is the
character set that was used.

You must change the XML file to declare the correct encoding, as it is an
error for a document to declare the wrong encoding. However you should
note that XML parsers are only required to support utf-8 and utf-16, so it
may be necessary to check your parser's documentation to see whether it
supports the actual encoding of the document.

Aside from that, choose one:

- Change the encoding of the XML file using a tool like Free Recode
  so that it matches what the file delcares its encoding to be

- Change the XML file to use character references instead of literal
  characters, for characters that are outside the ASCII range (0x20-0x7E)
  ... for example, &#xAE; Note that such references are for the ISO/IEC
  10646-1 universal character set, commonly though not accurately
  thought of as Unicode) ... this way your XML file will be entirely
  ASCII bytes, and since ASCII is a subset of UTF-8, it will be fine
  if the parser interprets it as UTF-8.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at         My XML/XSL resources:
webb.net in Denver, Colorado, USA           http://www.skew.org/xml/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread