Re: Unicode and XSL (was substring())

Subject: Re: Unicode and XSL (was substring())
From: James Clark <jjc@xxxxxxxxxx>
Date: Sun, 06 Jun 1999 11:50:54 +0700
David Carlisle wrote:
> 
> combining characters are not necessarily the main problem.
> I'd argue that they ought to count as separate characters as that is
> what they are in the character data of the XMl spec.

The problem is that many characters can be represented in Unicode both

- as a base character and one or more combining characters
- as a single precomposed character

Is "a acute" one character or two? This problem is particularily severe
when documents are using a legacy encoding (ie not one based on UCS).
When converting to Unicode, which of the alternative methods for
representing a character in Unicode should a converter choose?

There are two issues

(a) How do you define a canonical form so that there's a single answer
to questions like this?

(b) Where does the canonicalization happen?

Historically the answer to (a) has been that you canonicalize by
decomposing precomposed characters into their base+combining form.  More
recently it has been proposed that canonicalization should compose
base+combining combinations wherever there is a precomposed combination
available in a particular version of Unicode (probably 3.0).

For (b) the problem is that canonicalization is quite an expensive,
complex process.  The cost of requiring all Web clients (including very
lightweight clients like mobile phones and PDAs) always to canonicalize
data themselves would be prohibitive. So the current proposal is that
all data gets canonicalized as early as possible, ideally when it is
produced but in any case before it is sent over the Web.

There is another significant problem that I haven't touched on which is
compatibility characters.

See:

  http://www.w3.org/TR/WD-charmod
  http://www.unicode.org/unicode/reports/tr15/tr15-10.html

for more background.

James


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread