Re: [xsl] 8bit ascii encoding

Subject: Re: [xsl] 8bit ascii encoding
From: David Carlisle <davidc@xxxxxxxxx>
Date: Fri, 23 Aug 2002 15:01:42 +0100
> ha! no wonder I get confused...

It's best to read about encodings on an encoding faq page rather than my
notoriously trunctated and badly typed emails, but assuming you still
have faith in the latter....

> If my chars are two bytes each then Im using utf-16, but utf-8 can
> consist of 1-5bytes per char... I think I need to read some more.

grrr.. Your char(acters) don't have any bytes they are just characters
(aka unicode code points in the range hex 0 - 10FFFF).

To get those characters into a machine you need to encode them using
some encoding scheme.

Typical schemes (with a revisionist view of history) are

ascii.
 encode the first 128 characters of unicode using 7 bits, pad to 8 bits
 by setting the high bit 0. Forget about all other characters.

latin1 (iso-8859-1)
 encode the first 256 characters of unicode using 8 bits.
 Forget about all other characters.

latin2,3,... 8bit greek, cyrillic, microsoft windows 8bit code pages,etc etc.
 take a subset of 256 unicode characters in some specified order.
 encode them using 8 bits, forget about all other characters.



Encodings with names starting ut are special in that they encode the
whole of unicode rather than a subset.

utf8
  encode the character as a sequence of 1-5 bytes, using a simple and
  funky bit slicing mechanism that has the following properties.

  characters below 127 get encoded as a single byte (so match the asci
  encoding)

  No multi-byte sequence uses bytes below 127, so you always know you
  are part of a multi-byte sequence, as the top bit is set.

  These properties mean that for example a simple search for "<p>"
  in a utf8 file in a "legacy" 8 bit editor or search tool will
  find (or not) the characters "<p>" It will never stumble across some
  bytes in a multi-byte sequence that just happen to look like that.
  
  As multi-byte utf8 sequences are always of the form
  11xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
  there are essentially six bits of information per byte so the 
  number of bytes you need depends on how many bits were in the original
  number. 

utf16
  For characters with number less than hex 10000 (which was all of unicode
  until unicode 3 came out), encode the character in the natural way,
  taking two bytes. (two variants depending on whether you put the low
  byte first or last). Characters with numbers greater than FFFF are
  encoded using a pair of 2byte slots. So utf16 almost always takes
  2 bytes but can take 4 per character.
 

utf32
  just encode the number in the natural way taking four bytes per
  number. Simplest to describe, but rather expensive in terms of space.



> At the moment, Im using an xml output method with ascii encoding, and
> telling IE the encoding is utf-8 (in the meta),

So you are sitting on a time bomb. I suspect that you would be happiest
to use iso-8859-1 as above this allows you to use all western european
characters thorugh to uniocde number 255 in what you probably consider
to be the natural encoding.


David

_____________________________________________________________________
This message has been checked for all known viruses by Star Internet
delivered through the MessageLabs Virus Scanning Service. For further
information visit http://www.star.net.uk/stats.asp or alternatively call
Star Internet for details on the Virus Scanning Service.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread