RE: MSXML and Encoding

Subject: RE: MSXML and Encoding
From: Ian Brockbank <ian@xxxxxxxxxxxxxx>
Date: Wed, 8 Sep 1999 16:03:24 +0100
Hi Steven,

> Special European Characters don't seem to work for UTF-8 (at 
> least for the MSXML parser). I had a look at the W3C doc and tried
> the UTF-16 as they said it should be supported, but the at the start
> of parsing it said the encoding is not supported.

Indeed Special European Characters are not part of utf-8.  It matches
ASCII only as the character itself.
 
> I have read a bit now on the UTF-8 and UTF-16 explanations as 
> my knowledge of them isn't great. Does anybody have a few sentences
> to explain these ? - I am going to look at some stuff at unicode.org
> as well.

UTF-8 characters are between 1 and 3 bytes long, mapping approximately
as follows (it's a while since I did this, and this is from memory, so
apologies if I've not got it exactly right, but it's similar ).

UCS-2 char		UTF-8 mapping
-------------		-------------
0x0000-0x007f		0x0nnnnnnn
0x0080-0x03ff		0x110nnnnn 0x10nnnnnn
0x0400-0xffff		0x1110nnnn 0x10nnnnnn 0x10nnnnnn

where nnnnn... are the bits which build up the UCS-2 value.

Note:
You can tell what type of byte you have from the first 1-4 bits
0 - single-byte
10 - continuation
110 - 2-byte
1110 - 3-byte

This means that (eg) e (0xe9 => 0x11101001) is interpreted as the start
of a 3-byte character in the range 0x9000-0x9fff.

HTH,

Ian 
--
Ian Brockbank, Indigo Active Vision Systems, The Edinburgh Technopole,
Bush Loan, Edinburgh EH26 0PJ   Tel: 0131-475-7234  Fax: 0131-475-7201
work: ian@xxxxxxxxxxxxxx           personal: Ian.Brockbank@xxxxxxxxxxx
web: ScottishDance@xxxxxxxxxxx           http://www.scottishdance.net/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread