Re: [xsl] CJK UTF-16 test

Subject: Re: [xsl] CJK UTF-16 test
From: Mike Brown <mike@xxxxxxxx>
Date: Wed, 28 Mar 2001 21:35:34 -0700 (MST)
Benjamin Franz wrote:
> XML does NOT support UTF-16 since UTF-16 includes the surrogates

Wow, strike that from the archives, because it's dead wrong.

XML is specified in terms of sequences of allowable ISO/IEC 10646-1
characters, not particular binary-encoded representations of those
characters.

>    [2]   Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
>                   [#x10000-#x10FFFF]

These are characters, not UTF-16 bytes.

In ISO/IEC 10646-1 and Unicode _there is no character_ at code point 0xD800.

And in a UTF-16 encoded document, the bit sequence that I would write in hex
as D800 (big endian) or 00D8 (little) are not a character. The *sequence*
D800 DC00 (big) represents character #x10000, which I write here using the
same notation as the EBNF excerpt you quoted from the XML spec.

If you were to say that an XML document can contain a "character" #xD800 then
you would
  a.) be in violation of the definition of character as being what
      from ISO/IEC 10646-1 (which XML relies on), and
  b.) have no way of representing that character in a UTF-16 encoded
      document, because by definition, D800 in UTF-16 is the first half
      of a surrogate pair, not a character...


   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread