Subject: Re: [xsl] 8bit ascii encoding From: David Carlisle <davidc@xxxxxxxxx> Date: Fri, 23 Aug 2002 15:01:42 +0100 |
> ha! no wonder I get confused... It's best to read about encodings on an encoding faq page rather than my notoriously trunctated and badly typed emails, but assuming you still have faith in the latter.... > If my chars are two bytes each then Im using utf-16, but utf-8 can > consist of 1-5bytes per char... I think I need to read some more. grrr.. Your char(acters) don't have any bytes they are just characters (aka unicode code points in the range hex 0 - 10FFFF). To get those characters into a machine you need to encode them using some encoding scheme. Typical schemes (with a revisionist view of history) are ascii. encode the first 128 characters of unicode using 7 bits, pad to 8 bits by setting the high bit 0. Forget about all other characters. latin1 (iso-8859-1) encode the first 256 characters of unicode using 8 bits. Forget about all other characters. latin2,3,... 8bit greek, cyrillic, microsoft windows 8bit code pages,etc etc. take a subset of 256 unicode characters in some specified order. encode them using 8 bits, forget about all other characters. Encodings with names starting ut are special in that they encode the whole of unicode rather than a subset. utf8 encode the character as a sequence of 1-5 bytes, using a simple and funky bit slicing mechanism that has the following properties. characters below 127 get encoded as a single byte (so match the asci encoding) No multi-byte sequence uses bytes below 127, so you always know you are part of a multi-byte sequence, as the top bit is set. These properties mean that for example a simple search for "<p>" in a utf8 file in a "legacy" 8 bit editor or search tool will find (or not) the characters "<p>" It will never stumble across some bytes in a multi-byte sequence that just happen to look like that. As multi-byte utf8 sequences are always of the form 11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx there are essentially six bits of information per byte so the number of bytes you need depends on how many bits were in the original number. utf16 For characters with number less than hex 10000 (which was all of unicode until unicode 3 came out), encode the character in the natural way, taking two bytes. (two variants depending on whether you put the low byte first or last). Characters with numbers greater than FFFF are encoded using a pair of 2byte slots. So utf16 almost always takes 2 bytes but can take 4 per character. utf32 just encode the number in the natural way taking four bytes per number. Simplest to describe, but rather expensive in terms of space. > At the moment, Im using an xml output method with ascii encoding, and > telling IE the encoding is utf-8 (in the meta), So you are sitting on a time bomb. I suspect that you would be happiest to use iso-8859-1 as above this allows you to use all western european characters thorugh to uniocde number 255 in what you probably consider to be the natural encoding. David _____________________________________________________________________ This message has been checked for all known viruses by Star Internet delivered through the MessageLabs Virus Scanning Service. For further information visit http://www.star.net.uk/stats.asp or alternatively call Star Internet for details on the Virus Scanning Service. XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] 8bit ascii encoding, Andrew Welch | Thread | Re: [xsl] 8bit ascii encoding, Mike Brown |
RE: [xsl] Why processor or styleshe, TSchutzerWeissmann | Date | RE: [xsl] How to match a child elem, Américo Albuquerque |
Month |