Subject: Re: [xsl] 8bit ascii encoding From: Mike Brown <mike@xxxxxxxx> Date: Fri, 23 Aug 2002 09:43:31 -0600 (MDT) |
David Carlisle wrote: > > If my chars are two bytes each then Im using utf-16, but utf-8 can > > consist of 1-5bytes per char... I think I need to read some more. > > grrr.. Your char(acters) don't have any bytes they are just characters > (aka unicode code points in the range hex 0 - 10FFFF). (I'll just supplement what you said, for Andrew's benefit) i.e. a character is abstract, it is "the idea of the Latin (script)'s letter e with acute accent", not an actual glyph of an e with an accent that looks like a short forward slash, and not a particular byte (or series of bytes). Unicode just gives these ideas names and numbers.. (hex) E9 = LATIN SMALL LETTER E WITH ACUTE. Encoding schemes like us-ascii, iso-8859-1, utf-8, utf-16 are what give you representations of these numbers (and hence the unambiguous idea of the characters) as bits/bytes in specific sequences. It is when these bytes are decoded and interpreted to produce some visual representations of characters on your computer's display device that you run into the majority of your problems. Also, I'd point out that in Java the 'char' datatype is in theory representing a Unicode character by code point (just a number in the above range), but in fact it is implemented as a utf-16 code value (and endianness is underlying platform dependent I believe). In C I believe you have a choice of what char means but typically it's a similar situation. (someone C-literate can clarify) ... so saying 'char' can be misleading sometimes. > utf8 > encode the character as a sequence of 1-5 bytes, using a simple and > funky bit slicing mechanism that has the following properties. This little conversion chart can help you visualize it without getting into the nitty-gritty of bit slicing: http://skew.org/xml/cumped/ A very common problem that people run into is they look at UTF-8 encoded HTML through an editor, browser, or terminal window that is unaware that the encoding is UTF-8. The software is making the assumption that the bytes it is interpreting are iso-8859-1 or windows-1252 encoded (or some such; in any case it's just wrong). The user sees all the ASCII characters just fine, but an "accented" character or anything in Unicode above code point 127 shows up as two characters in the editor/browser/terminal. In the case of an editor/terminal you just need to get a smarter editor or just learn to live with the fact that é (Unicode xE9) is going to look like é because the UTF-8 bytes for that are C3 A9. For a browser looking at HTML you need to make sure there is an appropriate META tag in the document head, or else make sure it is being served with the right Content-Type: text/html;charset=utf-8. And then you also have to make sure the browser has been configured to honor this info; so many pages have misdeclared encodings that the browser makers have had to let the user force the assumed encoding. > > At the moment, Im using an xml output method with ascii encoding, and > > telling IE the encoding is utf-8 (in the meta), > > So you are sitting on a time bomb Nah. In theory that shouldn't present a problem. If the output really is ascii (one 8-bit bytes per character, and the high bit is always 0), the browser can safely (though wrongly) assume any encoding that's an ascii superset, which is pretty much anything except utf-16. Although technically it is a misdeclaration and XML 1.0 says that's a fatal error. I suspect his real problem is that his output is not really ascii, it's utf-16, and that it's the recurring FAQ about MSXML when you buffer the output in a string rather than a (document?) object. Either that or it's utf-8 and his browser is ignoring the meta. -Mike XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] 8bit ascii encoding, David Carlisle | Thread | RE: [xsl] 8bit ascii encoding, Michael Leditschke |
Re: [xsl] 8bit ascii encoding, Thomas B. Passin | Date | Re: [xsl] extracting data in CDATA , Mike Brown |
Month |