Re: Bug in 'xsl:sort'. ( XT vs SAXON. )

Subject: Re: Bug in 'xsl:sort'. ( XT vs SAXON. )
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sat, 05 Aug 2000 19:26:24 +0100
Paul,

>I wish nobody will kill me, but I'm sure that there is
>a bug either in XT or in SAXON. And I wish somebody
>who can read the specs better than me will tell me
>who is right.  XT is latest XT,  Saxon is instant SAXON
>downloaded today.  ( It says : SAXON 5.4 from Michael Kay of ICL )

If you go a little further on in the XSLT Recommendation, it says:

"NOTE: It is possible for two conforming XSLT processors not to sort
exactly the same. Some XSLT processors may not support some languages.
Furthermore, there may be variations possible in the sorting of any
particular language that are not specified by the attributes on xsl:sort,
for example, whether Hiragana or Katakana is sorted first in Japanese.
Future versions of XSLT may provide additional attributes to provide
control over these variations. Implementations may also use
implementation-specific namespaced attributes on xsl:sort for this.

NOTE: It is recommended that implementers consult [UNICODE TR10] for
information on internationalized sorting."

The values should be sorted "lexicographically in the culturally correct
manner for the language specified by lang" but I guess the question arises
in English (as it does in other languages) about whether '-' is
lexicographically before '0' or not.

If you follow up the UNICODE reference, there is a file that gives the
order for sorting just about every character you can think of
[http://www.unicode.org/unicode/reports/tr10/basekeys.txt].  In this file,
various sorts of hyphens:

00AD ; [*020B.0020.0002.00AD] # SOFT HYPHEN
002D ; [*020C.0020.0002.002D] # HYPHEN-MINUS
FF0D ; [*020C.0020.0003.FF0D] # FULLWIDTH HYPHEN-MINUS; COMPAT
FE63 ; [*020C.0020.000F.FE63] # SMALL HYPHEN-MINUS; COMPAT
2010 ; [*020D.0020.0002.2010] # HYPHEN
2011 ; [*020D.0020.001B.2011] # NON-BREAKING HYPHEN; COMPAT
2012 ; [*020E.0020.0002.2012] # FIGURE DASH
2013 ; [*020F.0020.0002.2013] # EN DASH
FE32 ; [*020F.0020.0016.FE32] # PRESENTATION FORM FOR VERTICAL EN DASH; COMPAT
2014 ; [*0210.0020.0002.2014] # EM DASH
FE58 ; [*0210.0020.000F.FE58] # SMALL EM DASH; COMPAT

come before (i.e. should be sorted before) various forms of 0:

0030 ; [.06B9.0020.0002.0030] # DIGIT ZERO
FF10 ; [.06B9.0020.0003.FF10] # FULLWIDTH DIGIT ZERO; COMPAT
24EA ; [.06B9.0020.0006.24EA] # CIRCLED DIGIT ZERO; COMPAT
2070 ; [.06B9.0020.0014.2070] # SUPERSCRIPT ZERO; COMPAT
2080 ; [.06B9.0020.0015.2080] # SUBSCRIPT ZERO; COMPAT
0660 ; [.06B9.011C.0002.0660] # ARABIC-INDIC DIGIT ZERO
06F0 ; [.06B9.011D.0002.06F0] # EXTENDED ARABIC-INDIC DIGIT ZERO
0966 ; [.06B9.011E.0002.0966] # DEVANAGARI DIGIT ZERO
09E6 ; [.06B9.011F.0002.09E6] # BENGALI DIGIT ZERO
0A66 ; [.06B9.0121.0002.0A66] # GURMUKHI DIGIT ZERO
0AE6 ; [.06B9.0122.0002.0AE6] # GUJARATI DIGIT ZERO
0B66 ; [.06B9.0123.0002.0B66] # ORIYA DIGIT ZERO
0C66 ; [.06B9.0125.0002.0C66] # TELUGU DIGIT ZERO
0CE6 ; [.06B9.0126.0002.0CE6] # KANNADA DIGIT ZERO
0D66 ; [.06B9.0127.0002.0D66] # MALAYALAM DIGIT ZERO
0E50 ; [.06B9.0128.0002.0E50] # THAI DIGIT ZERO
0ED0 ; [.06B9.0129.0002.0ED0] # LAO DIGIT ZERO
0F20 ; [.06B9.012A.0002.0F20] # TIBETAN DIGIT ZERO
0F33 ; [.06B9.012A.0002.0F33] # TIBETAN DIGIT HALF ZERO; COMPAT
3007 ; [.06B9.012B.0002.3007] # IDEOGRAPHIC NUMBER ZERO

This would imply that '-1' should be before '0' because '-' sorts before
'0'.  However, on
[http://www.unicode.org/unicode/reports/tr10/index.html#Alternate
Weighting] there is some extra stuff about options involving the weighting
of hyphens (& various other characters) that might contradict this but that
I can't get my head around right now.

I don't think that either SAXON or XT is 'right'.  They employ different
sort orders, but from what I can gather, it's fine for them to do so and
still both be compliant.  Eventually the differences between them should be
diminished through the specification of additional attributes.

Cheers,

Jeni



Dr Jeni Tennison
Epistemics Ltd * Strelley Hall * Nottingham * NG8 6PE
tel: 0115 906 1301 * fax: 0115 906 1304 * email: jeni.tennison@xxxxxxxxxxxxxxxx


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread