Subject: Re: [xsl] output encoding="iso-8859-1" From: Mike Brown <mike@xxxxxxxx> Date: Mon, 4 Jun 2001 20:04:27 -0600 (MDT) |
Daniel Florian wrote: > <?xml version="1.0" encoding="utf-8"?> > <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?> > <start> > á ° > </start> Everyone else's answers weren't to my satisfaction, so I'm jumping in on this one even though it's a few days old. Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm guessing that your original file is iso-8859-1 encoded, too. Your XML is misdeclaring its encoding. It is an error to say it is utf-8 encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but apparently your parser doesn't care. ᠰ = ᠰ which is equivalent to the bytes 0xE1 0xA0 0xB0 in utf-8. I'd say your parser is being very liberal with its interpretation of the bytes. > What character reference is the ᠰ? This is supposed to be ISO-8859-1 > isn't it? The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output as their 7 respective iso-8859-1 bytes, as per your xsl:output instruction, yes. What "ᠰ" means, however, in the context of an XML or HTML document, is the single character known as MONGOLIAN LETTER SA. > Then how come I can't seem to find the character code for 6192 Maybe because you weren't looking at The Unicode Standard at unicode.org, or the Letter Database at http://www.eki.ee/letter/, or at the standard that is referenced by both the XML and HTML specs: ISO/IEC 10646-1. > And also, what happened to the 2 distinct characters from the > source xml? Your 3 characters (including the space in between them) became 3 bytes in the encoding supported by the editor that made the file. When read back in by an XML parser under the assumption that utf-8 was the character map used, and taking into account the fact that your parser is apparently very forgiving of the illegal byte sequence, the 3 bytes together imply 1 abstract character -- that Mongolian character that you probably won't find in any font. When this character is copied to the result tree in your XSL transformation, it retains its identity as a single character. When the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it is impossible to represent this character as anything other than "ᠰ" or "ᠰ" - Mike _____________________________________________________________________________ mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/ webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] output encoding="iso-8859, David Carlisle | Thread | Re: [xsl] output encoding="iso-8859, Michael Beddow |
[xsl] find the correct rows to appl, Xiaocun Xu | Date | Re: [xsl] Problem in making choices, Sreekanth Pallavoor |
Month |