| Subject: Re: [xsl] xml invalid characters From: Mike Brown <mike@xxxxxxxx> Date: Fri, 22 Mar 2002 16:08:11 -0700 (MST) | 
stevenson wrote:
> How can I avoid these problem. The data is from the database, and the
> character crashing it is £
You probably have an encoding problem. I assume that you're having trouble
with the British currency symbol for a Pound? At least, that's what it looks
like on my screen.
Quick lesson:
The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.
Encoding provides a way of representing that A3 as bytes.
iso-8859-1:  A3
     utf-8:  C2 A3
    utf-16:  00 A3 (little endian)
             A3 00 (big endian)
utf-8 and utf-16 can represent any Unicode character, but other encodings are 
more limited, usually only representing 256 characters max.
If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):
   & # x A 3 ;    or    & # 1 6 3 ;
For example, us-ascii does not have POUND SIGN (this may be the source of your 
problem; it's hard to say, without knowing all the stages of processing of 
your data, and the role Cold Fusion plays in it). So you'd have to use this 
escaped format.
             &  #  x  A  3  ;
  us-ascii:  26 23 78 41 33 3B
And this escaped format (a "character reference") also works just as well in 
other encodings:
iso-8859-1:  26 23 78 41 33 3B
     utf-8:  26 23 78 41 33 3B
    utf-16:  00 26 00 23 00 78 00 41 00 33 00 3B (little endian)
    utf-16:  26 00 23 00 78 00 41 00 33 00 3B 00 (big endian)
Now check your XML document. When you look at the document in a text editor, 
it might say 
<?xml version="1.0" encoding="utf-8"?>
                    ^^^^^^^^^^^^^^^^
This encoding declaration is an assertion made by the document as to how its
bytes map to Unicode characters. It is just a hint for the XML parser to use
when reading the document; it is not secret code that causes anything about
the document's *actual* encoding to change. 
If this declaration is missing, UTF-8 or UTF-16 are assumed 
(UTF-8 unless the document begins with bytes FF FE or FE FF).
It is your responsibility to ensure that the encoding declaration is an
accurate reflection of the document's *actual* encoding.
As you can guess, this is where most people run into problems. They are
passing "text" around in their software without paying attention to whether &
how it has been encoded. So, in order to diagnose encoding related problems,
you must trace the processes that your data passes through, and determine how
it is encoded/decoded at each step.
Also, you didn't say what your problem has to do with XSLT. This is the 
xsl-list. If you have general xml processing questions, ask them on xml-dev.
If you're using XSLT, then you usually only need to be concerned about
 - the source and stylesheet XML documents must have accurate encoding 
   declarations
 - the output encoding, as controlled by <xsl:output encoding="..."/>
   should be what you wanted (there is a FAQ regarding invoking MSXML
   from scripts, where the output becomes UTF-16, depending on how
   you capture it)
Good luck.
   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/
 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
| Current Thread | 
|---|
| 
 | 
| <- Previous | Index | Next -> | 
|---|---|---|
| [xsl] xml invalid characters, stevenson | Thread | [xsl] regarding variables and axis, william locksman | 
| Re: [xsl] invalid xml characters, Thomas B. Passin | Date | RE: [xsl] invalid xml characters, Joshua Allen | 
| Month |