RE: MSXML and Encoding

Subject: RE: MSXML and Encoding
From: Ian Brockbank <ian@xxxxxxxxxxxxxx>
Date: Wed, 8 Sep 1999 16:19:58 +0100
> UTF-8 characters are between 1 and 3 bytes long, mapping
> approximately as follows (it's a while since I did this, and this is
> from memory, so apologies if I've not got it exactly right, but it's
> similar ).
> 
> UCS-2 char		UTF-8 mapping
> -------------	-------------
> 0x0000-0x007f	0x0nnnnnnn
> 0x0080-0x03ff	0x110nnnnn 0x10nnnnnn
> 0x0400-0xffff	0x1110nnnn 0x10nnnnnn 0x10nnnnnn
> 
> where nnnnn... are the bits which build up the UCS-2 value.
> 
> Note:
> You can tell what type of byte you have from the first 1-4 bits
> 0 - single-byte
> 10 - continuation
> 110 - 2-byte
> 1110 - 3-byte

Which means you can easily find the nearest character boundary -
search for the next byte starting with 0, 110 or 1110.

> This means that (eg) e (0xe9 => 0x11101001) is interpreted as 
> the start of a 3-byte character in the range 0x9000-0x9fff.

The UTF-8 encoding for e is

	0080-03ff ->
		110nnnnn 10nnnnnn

where nnnnn nnnnnn are 000 11101001 ie

		11000011 10101001

or 0xc3 0xa9 or A?

HTH,

Ian
--
Ian Brockbank, Indigo Active Vision Systems, The Edinburgh Technopole,
Bush Loan, Edinburgh EH26 0PJ   Tel: 0131-475-7234  Fax: 0131-475-7201
work: ian@xxxxxxxxxxxxxx           personal: Ian.Brockbank@xxxxxxxxxxx
web: ScottishDance@xxxxxxxxxxx           http://www.scottishdance.net/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread