Subject: RE: [xsl] output encoding="iso-8859-1" From: Daniel Florian <DFlorian@xxxxxxxxxxxxx> Date: Tue, 5 Jun 2001 08:38:05 -0400 |
Thanks very much for the detailed answer Mike, this is all starting to make sense. There were a couple of points that make this stuff easier to understand. Forgive me if this is stating the obvious, but it took me a while to synthesize this... the info is scattered all over the place. 1) The "&#xxx;" notation in XML and HTML files are character references, which refer to the decimal value of the character in the Unicode character set. This is entirely different from the encoding scheme that the document declares. If the encoding scheme says ISO-8859-1 these character references still refer to Unicode character values. 2) The encoding scheme is supposed to declare the actual byte encoding of the doc. That's all. 3) It is non-trivial to manage content with extended characters across a number of different applications and operating systems... Clearly, in my case, strange stuff happened to the byte ordering during the "cut and paste" process, and, as well, I am not sure if the apps I was using to view the content were able to make sense of the UTF-8 multibyte characters anyway. Rather than assume this will work you really need to discuss each application individually. My conclusion for now is that the safest way to do manage content with international characters is to use the character references as discussed in #1. Unfortunately, this won't result in a WYSIWYG editing system, but that's a small price to pay for increased portability of the content, across all kinds of editors and OS's. I'm sure I'll hear if some of this isn't accurate, Thanks, -Dan -----Original Message----- From: Mike Brown [mailto:mike@xxxxxxxx] Sent: Monday, June 04, 2001 10:04 PM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] output encoding="iso-8859-1" Daniel Florian wrote: > <?xml version="1.0" encoding="utf-8"?> > <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?> > <start> > á ° > </start> Everyone else's answers weren't to my satisfaction, so I'm jumping in on this one even though it's a few days old. Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm guessing that your original file is iso-8859-1 encoded, too. Your XML is misdeclaring its encoding. It is an error to say it is utf-8 encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but apparently your parser doesn't care. ᠰ = ᠰ which is equivalent to the bytes 0xE1 0xA0 0xB0 in utf-8. I'd say your parser is being very liberal with its interpretation of the bytes. > What character reference is the ᠰ? This is supposed to be ISO-8859-1 > isn't it? The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output as their 7 respective iso-8859-1 bytes, as per your xsl:output instruction, yes. What "ᠰ" means, however, in the context of an XML or HTML document, is the single character known as MONGOLIAN LETTER SA. > Then how come I can't seem to find the character code for 6192 Maybe because you weren't looking at The Unicode Standard at unicode.org, or the Letter Database at http://www.eki.ee/letter/, or at the standard that is referenced by both the XML and HTML specs: ISO/IEC 10646-1. > And also, what happened to the 2 distinct characters from the > source xml? Your 3 characters (including the space in between them) became 3 bytes in the encoding supported by the editor that made the file. When read back in by an XML parser under the assumption that utf-8 was the character map used, and taking into account the fact that your parser is apparently very forgiving of the illegal byte sequence, the 3 bytes together imply 1 abstract character -- that Mongolian character that you probably won't find in any font. When this character is copied to the result tree in your XSL transformation, it retains its identity as a single character. When the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it is impossible to represent this character as anything other than "ᠰ" or "ᠰ" - Mike ____________________________________________________________________________ _ mike j. brown, software engineer at | xml/xslt: http://skew.org/xml/ webb.net in denver, colorado, USA | personal: http://hyperreal.org/~mike/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] output encoding="iso-8859, Michael Beddow | Thread | Re: [xsl] output encoding="iso-8859, Mike Brown |
[xsl] Follow-Up: Dynamic XML to XML, Kyle D. Morton | Date | [xsl] Position of parent node, Athanasios Gaitatzes |
Month |