## [xsl] order-by vs xsl:sort

 Subject: [xsl] order-by vs xsl:sort From: David Carlisle Date: Thu, 1 Sep 2005 17:21:24 +0100
```In the interests of keeping the archives of this list accurate...
Back in April under a thread of this name that I started,
Michael said, quoting me:

>> Not really. It models it as a sequence of integers of length M*N and
>> calculates the sort key (and data needed for the result) by
>> calculating
>> the equivalent indexes into the original sequences using mod and idiv.
>>
>> given the general 2 variable case:
>>
>>
>> for \$i in \$is,  \$j in \$js
>> order by f(\$i,\$j)
>> return
>> g(\$i,\$j)
>>
>> for some functions f and g then I think you can always replace this by
>>
>> let \$ci :=count(\$is) return
>> let \$cj :=count(\$js) return
>> for \$n in (0 to \$ci * \$cj)
>> let \$i :=\$is[\$n  mod \$ci)+1]
>> let \$j := \$js[(\$n idiv \$ci) +1]
>> order by f(\$i, \$j)
>> return
>> g(\$i,\$j)
>>
>> and once you have just a single for and order by, converting
>> that to xsl
>> for-each and sort is just syntax.
>>
>
>Nice result. Perhaps I can get rid of those horrible tuples in my
>implementation!

Unfortunately the premise of my statement is wrong. The example is
the general case for a cartesian product, but FLWOR loops actually model
a dependent product where the sequence over which the inner variable is
iterating can depend on the outer variable (and in particular can have
varying length).

This means that you can't pre-calculate the dimensions, so can't use a
simple div and mod formula to generate the tuple indexes from a single
integer.

All is not lost though (I think) but it gets a bit more expensive.

the general two variable case is

for \$i in \$is,
\$j in F(\$i)
order by f(\$i,\$j)
return
g(\$i,\$j)

However I believe this may be rewritten as

let \$one := (
for \$i at \$ip in \$is,
\$j at \$jp in F(\$i)
return
encode(\$ip,\$jp)
)
return

for \$z in \$one
let \$indexes:=decode(\$z)
let \$ip:=\$indexes
let \$i:= \$is[\$ip]
let \$jp:=\$indexes
let \$j:= (F(\$i))[\$jp]
order by f(\$i,\$j)
return
g(\$i,\$j)

where encode is anything that encodes a sequence of integers as a single
item and decode is anything that gets the sequence of integers back.
Originally I had \$one being a sequence of elements with the decimal
values of the integers as content
(<z>1 2</z>,<z>1 3</z>,...
and decode() being a call to tokenize() but that seems rather expensive
to generate all those temporary elements in a sequence just used once.
So currently I encode the list of integers as a string using
codepoints-to-string and back using string-to-codepoints.
This requires a bit of offsetting to avoid banned codepoints like 0, and
(since I use a single character per integers) limits me to a million or
so entries per sequence, but could easily use more characters per
integer.

Main disadvantage of all this is that the expresions F(\$i) above that
generate the sequnces get evaluated multiple times and might be
expensive, especially in the case  that they don't actually depend on
the outer variable. I wonder if I could rely on the XSLT optimising
complier to spot such repeated expressions and take them out of the
loop, or whether an xquery to xslt conversion should do this as part of
the conversion step...

(the above only uses xquery notation but the point of it is to express
FLWOR as xsl:for-each: once you have it rewritten as a loop with only
one for-variable, writing as xslt is trivial.

David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star. The