Re: [xsl] output encoding="iso-8859-1"

Subject: Re: [xsl] output encoding="iso-8859-1"
From: Mike Brown <mike@xxxxxxxx>
Date: Mon, 4 Jun 2001 20:04:27 -0600 (MDT)
Daniel Florian wrote:
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

Everyone else's answers weren't to my satisfaction, so I'm jumping in on 
this one even though it's a few days old.

Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter
a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm
guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8
encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out
to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but 
apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in 
utf-8. I'd say your parser is being very liberal with its interpretation
of the bytes.

> What character reference is the &#6192?  This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output 
as their 7 respective iso-8859-1 bytes, as per your xsl:output 
instruction, yes. What "&#6192;" means, however, in the context of an XML 
or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at,
or the Letter Database at, or at the standard
that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in
the encoding supported by the editor that made the file. When read back in
by an XML parser under the assumption that utf-8 was the character map
used, and taking into account the fact that your parser is apparently very
forgiving of the illegal byte sequence, the 3 bytes together imply 1
abstract character -- that Mongolian character that you probably won't
find in any font. When this character is copied to the result tree in your
XSL transformation, it retains its identity as a single character. When
the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it
is impossible to represent this character as anything other than "&#6192;"
or "&#x1830;"

   - Mike
mike j. brown, software engineer at  |  xml/xslt: in denver, colorado, USA    |  personal:

 XSL-List info and archive:

Current Thread