Re: Unicode and XSL (was substring())

David Carlisle wrote:
> 
> combining characters are not necessarily the main problem.
> I'd argue that they ought to count as separate characters as that is
> what they are in the character data of the XMl spec.

The problem is that many characters can be represented in Unicode both

- as a base character and one or more combining characters
- as a single precomposed character

Is "a acute" one character or two? This problem is particularily severe
when documents are using a legacy encoding (ie not one based on UCS).
When converting to Unicode, which of the alternative methods for
representing a character in Unicode should a converter choose?

There are two issues

(a) How do you define a canonical form so that there's a single answer
to questions like this?

(b) Where does the canonicalization happen?

Historically the answer to (a) has been that you canonicalize by
decomposing precomposed characters into their base+combining form.  More
recently it has been proposed that canonicalization should compose
base+combining combinations wherever there is a precomposed combination
available in a particular version of Unicode (probably 3.0).

For (b) the problem is that canonicalization is quite an expensive,
complex process.  The cost of requiring all Web clients (including very
lightweight clients like mobile phones and PDAs) always to canonicalize
data themselves would be prohibitive. So the current proposal is that
all data gets canonicalized as early as possible, ideally when it is
produced but in any case before it is sent over the Web.

There is another significant problem that I haven't touched on which is
compatibility characters.

See:

  http://www.w3.org/TR/WD-charmod
  http://www.unicode.org/unicode/reports/tr15/tr15-10.html

for more background.

James

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

Current Thread
Re: Unicode and XSL (was substring()), (continued) Paul Prescod - Sat, 05 Jun 1999 20:34:04 -0500 James Clark - Sun, 06 Jun 1999 11:28:26 +0700 David Carlisle - Mon, 7 Jun 1999 10:00:50 +0100 (BST) David Carlisle - Sun, 6 Jun 1999 15:30:04 +0100 (BST) James Clark - Sun, 06 Jun 1999 11:50:54 +0700 <= Chuck White - Fri, 04 Jun 1999 10:12:05 -0700 Sara Mitchell - Fri, 04 Jun 1999 11:00:21 -0700 WorldNet - Fri, 4 Jun 1999 13:41:22 -0500 Sebastian Rahtz - Fri, 4 Jun 1999 20:55:00 +0100 (BST)

Current Thread

Re: Unicode and XSL (was substring()), (continued)
- Chuck White - Fri, 04 Jun 1999 10:12:05 -0700
  - Sara Mitchell - Fri, 04 Jun 1999 11:00:21 -0700
  - WorldNet - Fri, 4 Jun 1999 13:41:22 -0500
    - Sebastian Rahtz - Fri, 4 Jun 1999 20:55:00 +0100 (BST)

<- Previous	Index	Next ->
Re: Unicode and XSL (was substring(, David Carlisle	Thread	Re: substring(), Chuck White
Re: Unicode and XSL (was substring(, James Clark	Date	RE: Leventhal's challenge misses th, Andy Dent
	Month

<-prev [Thread] next->	<-prev [Date] next->
Month Index \| List Home