Re: Unicode and XSL (was substring())

Subject: Re: Unicode and XSL (was substring())
From: James Clark <jjc@xxxxxxxxxx>
Date: Sun, 06 Jun 1999 11:28:26 +0700
Paul Prescod wrote:
> 
> David Carlisle wrote:
> >
> > Harder are characters out of the basic plane of unicode. These are a
> > single character in XML eg accessed by a single &#1234; but since
> > they don't fit into 16bits, they take up two slots when the unicode
> > is encoded in utf-16. So the natural thing to do is to count these
> > characters as single characters, but that means string indexing requires
> > walking the string and thus proportional to the index rather than being
> > a constant time array lookup. It also means that indexing and string
> > length give different values if you use a `pure XML' approach or if
> > you escape out to some language that treats strings as an array of 16bit
> > quantities.
> 
> Why are you worrying about the encoding? If your programming language is
> broken in its handling of the platonic ideal concept of characters then
> that is the XSL implementor's problem. There are ways of getting this
> right: you can just use 32 bit characters or you can switch your character
> width or iteration algoritm based on the actual contents of a string. This
> isn't trivial but it is an implementor's problem and should not be
> reflected in XSL.

I basically agree with this.  Counting characters by counting the the
16-bit quantities that encode characters in UTF-16 makes about as much
sense as counting characters by counting the 8-bit quantities that
encode characters in UTF-8 (which would mean for example that a dollar
counts as one character, and a pound sterling sign counts as 2
charcaters).  The counter-argument to this is that the DOM counts using
UTF-16. I would respond by saying that the DOM is not counting
characters but counting 16-bit quantities; there's nothing wrong with
counting 16-bit quantities any more than there is with counting 8-bit
quantities, it just isn't the same thing as counting characters.  The
XML Rec defines what a character is for XML and that is what we should
count.

James



 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread