Re: Bug in 'xsl:sort'. ( XT vs SAXON. )

Subject: Re: Bug in 'xsl:sort'. ( XT vs SAXON. )
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sun, 06 Aug 2000 11:34:21 +0100
Paul,

>> If you go a little further on in the XSLT Recommendation, it says:
>> 
>> "NOTE: It is possible for two conforming XSLT processors not to sort
>> exactly the same. Some XSLT processors may not support some languages.
>> Furthermore, there may be variations possible in the sorting of any
>> particular language that are not specified by the attributes on xsl:sort,
>> for example, whether Hiragana or Katakana is sorted first in Japanese.
>
>This is not the case here, right? ( Actualy I don't understand 
>why something other than UTF * should supported 
>by W3C standards, but that's another story ).

Well I thought it might be the case here, that this might be a variation in
the sorting of English (the particular language) not specified by the
attributes on xsl:sort.  For example, one might rationally use the rule
'ignore hyphens' when sorting, thinking that hyphens do not add semantic
information to a term, or 'ignore hyphens only in the middle of words' or
'ignore hyphens when they are not followed by a number' and so on.  I don't
think any of these rules are unreasonable, and in certain situations they
will lead to different results.

>> Future versions of XSLT may provide additional attributes to provide
>> control over these variations. Implementations may also use
>> implementation-specific namespaced attributes on xsl:sort for this.
>
>This is also not the case, right ?

In that we are not using a future version of XSLT and neither SAXON nor XT
have documented implementation-specific namespaced attributes to determine
sort order, yes.

>> NOTE: It is recommended that implementers consult [UNICODE TR10] for
>> information on internationalized sorting."
>> 
>> The values should be sorted "lexicographically in the culturally correct
>> manner for the language specified by lang" but I guess the question arises
>> in English (as it does in other languages) about whether '-' is
>> lexicographically before '0' or not.
>
>Right. But I'm not sure the question is about 'English'. I think the 
>question realy is 'in UTF8' ?

I disagree.  The xsl:sort documentation says: "'text' specifies that the
sort keys should be sorted lexicographically in the culturally correct
manner for the language specified by lang".  I'm assuming that the default
language in Sebastians files is English.  Thus the sort should be done in
English.

I am no expert on character encoding, but as far as I understand it, the
UTF8 values for ASCII characters all come before the UTF8 characters for
accented characters.  If you sorted on UTF8 character value, 'z' would come
before &aacute;, whereas you would expect 'a' and all its associated
accents to be grouped together.  If you look at the UNICODE basekey file
[http://www.unicode.org/unicode/reports/tr10/basekeys.txt], you can see
that there are groups of characters with all different kinds of UTF8
values.  For example all those zeros that I extracted and sent in my last
mail, come before another set of ones from various languages.

A UTF8 value is basically a dangerous way to sort characters if you're
dealing with anything bar bare English, and even with just English, as we
have seen, punctuation and spacing still provide problem areas.

>Why? There is no special encodings or special sorting attributes. 
>Both engines receive the same 'lang' environment (  Or they dont??? ) , 
>why they employ different sort orders? 

Probably because Mike Kay and James Clark think that different rules apply
to sorting in English, although it's possible that one of the processors is
sorting based on something other than a lang-dependent order.

>I still think something is strange here. They both are sorting UTF8 (?)
>without any special cases mentioned in the W3C paper and the 
>question is :  "in  UTF8(?) what comes first '-' or '0' ?"  - Right?
>Is it legal they are giving the different ansewers to teh same question?

No, the question is: "in English, what comes first: '-' or '0'?".  It is
legal for them to give different answers, it's even compliant of them, it's
just not particularly helpful :)

>> Eventually the differences between them should be
>> diminished through the specification of additional attributes.
>
>Pardon, what attrubutes do you mean ???

>From the XSLT Recommendation:

"Future versions of XSLT may provide additional attributes to provide
control over these variations. Implementations may also use
implementation-specific namespaced attributes on xsl:sort for this."

For example, Mike could add an extension attribute to xsl:sort called
saxon:ignore-hyphens.  When the value is 'yes', then hyphens are simply
ignored (and '-1' will sort after '0'); when the value is 'no', then
hyphens are taken into account (and '-1' will sort before '0').

Or in the next version of XSLT, there might be an xsl2:alternate-weighting
attribute defined on xsl:sort with the values of 'blanked', 'non-ignorable'
and 'shifted', each giving different weightings to collation elements like
hyphens and spaces as described in
[http://www.unicode.org/unicode/reports/tr10/index.html#Alternate Weighting].

>I now think maybe this is is the bug in XT ?

It's certainly possible that XT doesn't employ lang-specific lexicographic
sort orders, but I think it's unlikely.

Ideally, XSLT Processors would document the rules they use to sort text;
the differences between them would form the input into the set of
attributes for xsl:sort in the next version of XSLT; and all the XSLT
Processors would then implement the variant sorts.  Then you, as the
stylesheet author, would be able to specify which type of sort you wanted,
and be able to consistently get it across XSLT Processors.  But I don't
think that this is a matter of 'right' and 'wrong' at the moment.

Cheers,

Jeni

Dr Jeni Tennison
Epistemics Ltd * Strelley Hall * Nottingham * NG8 6PE
tel: 0115 906 1301 * fax: 0115 906 1304 * email: jeni.tennison@xxxxxxxxxxxxxxxx


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread