Re: [xsl] xml invalid characters

Subject: Re: [xsl] xml invalid characters
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 22 Mar 2002 16:08:11 -0700 (MST)
stevenson wrote:
> How can I avoid these problem. The data is from the database, and the
> character crashing it is £

You probably have an encoding problem. I assume that you're having trouble
with the British currency symbol for a Pound? At least, that's what it looks
like on my screen.

Quick lesson:

The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.

Encoding provides a way of representing that A3 as bytes.

iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)

utf-8 and utf-16 can represent any Unicode character, but other encodings are 
more limited, usually only representing 256 characters max.

If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):

   & # x A 3 ;    or    & # 1 6 3 ;

For example, us-ascii does not have POUND SIGN (this may be the source of your 
problem; it's hard to say, without knowing all the stages of processing of 
your data, and the role Cold Fusion plays in it). So you'd have to use this 
escaped format.

             &  #  x  A  3  ;
  us-ascii:  26 23 78 41 33 3B

And this escaped format (a "character reference") also works just as well in 
other encodings:

iso-8859-1:  26 23 78 41 33 3B
     utf-8:  26 23 78 41 33 3B
    utf-16:  00 26 00 23 00 78 00 41 00 33 00 3B (little endian)
    utf-16:  26 00 23 00 78 00 41 00 33 00 3B 00 (big endian)

Now check your XML document. When you look at the document in a text editor, 
it might say 

<?xml version="1.0" encoding="utf-8"?>
                    ^^^^^^^^^^^^^^^^

This encoding declaration is an assertion made by the document as to how its
bytes map to Unicode characters. It is just a hint for the XML parser to use
when reading the document; it is not secret code that causes anything about
the document's *actual* encoding to change. 

If this declaration is missing, UTF-8 or UTF-16 are assumed 
(UTF-8 unless the document begins with bytes FF FE or FE FF).

It is your responsibility to ensure that the encoding declaration is an
accurate reflection of the document's *actual* encoding.

As you can guess, this is where most people run into problems. They are
passing "text" around in their software without paying attention to whether &
how it has been encoded. So, in order to diagnose encoding related problems,
you must trace the processes that your data passes through, and determine how
it is encoded/decoded at each step.

Also, you didn't say what your problem has to do with XSLT. This is the 
xsl-list. If you have general xml processing questions, ask them on xml-dev.

If you're using XSLT, then you usually only need to be concerned about

 - the source and stylesheet XML documents must have accurate encoding 
   declarations

 - the output encoding, as controlled by <xsl:output encoding="..."/>
   should be what you wanted (there is a FAQ regarding invoking MSXML
   from scripts, where the output becomes UTF-16, depending on how
   you capture it)

Good luck.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread