Subject: Re: Unicode and XSL (was substring()) From: James Clark <jjc@xxxxxxxxxx> Date: Sun, 06 Jun 1999 11:50:54 +0700 |
David Carlisle wrote: > > combining characters are not necessarily the main problem. > I'd argue that they ought to count as separate characters as that is > what they are in the character data of the XMl spec. The problem is that many characters can be represented in Unicode both - as a base character and one or more combining characters - as a single precomposed character Is "a acute" one character or two? This problem is particularily severe when documents are using a legacy encoding (ie not one based on UCS). When converting to Unicode, which of the alternative methods for representing a character in Unicode should a converter choose? There are two issues (a) How do you define a canonical form so that there's a single answer to questions like this? (b) Where does the canonicalization happen? Historically the answer to (a) has been that you canonicalize by decomposing precomposed characters into their base+combining form. More recently it has been proposed that canonicalization should compose base+combining combinations wherever there is a precomposed combination available in a particular version of Unicode (probably 3.0). For (b) the problem is that canonicalization is quite an expensive, complex process. The cost of requiring all Web clients (including very lightweight clients like mobile phones and PDAs) always to canonicalize data themselves would be prohibitive. So the current proposal is that all data gets canonicalized as early as possible, ideally when it is produced but in any case before it is sent over the Web. There is another significant problem that I haven't touched on which is compatibility characters. See: http://www.w3.org/TR/WD-charmod http://www.unicode.org/unicode/reports/tr15/tr15-10.html for more background. James XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: Unicode and XSL (was substring(, David Carlisle | Thread | Re: substring(), Chuck White |
Re: Unicode and XSL (was substring(, James Clark | Date | RE: Leventhal's challenge misses th, Andy Dent |
Month |