Re: [xsl] xsl:sort with msxml english language, danish characters, weird results

Subject: Re: [xsl] xsl:sort with msxml english language, danish characters, weird results
From: "W. Eliot Kimber" <ekimber@xxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Oct 2004 14:28:21 -0500
Bryan Rasmussen wrote:

                                           now I don't suppose that there is a
processor anywhere that supports sorting in pre print versions of languages but
if it there was i guess it wouldn't matter because while you can set en-uk
you can't set languages by historical time periods (actually I suppose that
english that early would be an-sa or something right?) :)

At least in the Java XSLT processing domain using at least Saxon you can implement custom collators to support any collation rules you want, including those of old English or whatever.

As far as I'm concerned, any XSLT processor that does not provide a clear and direct way to integrate arbitrary collators is not very useful (but then almost all of my use of XSLT is to process technical documents with indexes and glossaries in 50+ national languages and sometimes ideosyncratic editorial rules for collation).

A textbook example of why collation has to be custom is Simplified Chinese--it's collated based on the Pin-Yin transliteration of the ideographic characters. For example, the character for "horse" is pronounced "ma" in Mandarin, so it would sort under "M" in the index.

The problem is that there is no single authority for the transliteration of all characters. Many characters have alternative pronounciations, such as "b" or "v" depending on local usage. So there cannot be a single authoritative collation rule for Simplified Chinese--it will always vary based on the local transliteration practice or, sometimes, the opinion of one person or another. You can see this in the Unicode "unihan" database, which provides lots of information about the Chinese ideographs, including Mandarin and Cantonese transliterations. Many characters have at least two Mandarin transliterations.

I don't use MSXML, but I'm guessing that it relies entirely on Windows' built-in regional settings for collation. That's simply not good enough, at least for technical and academic documents.



W. Eliot Kimber
Professional Services
Innodata Isogen
9390 Research Blvd, #410
Austin, TX 78759
(512) 372-8122


Current Thread