RE: [xsl] Benefits of using xsl:key

Subject: RE: [xsl] Benefits of using xsl:key
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Tue, 3 Nov 2009 22:27:06 -0000
> By modifying the simple example by Michael M|ller-Hillebrand:
>
> Input.xml
>
> <Doc>
>     <value oid="f37">some text</value>
>     <value oid="f61">some text</value>
>     <value oid="f042">some other text</value> </Doc>
>
> Copy and pasting so we have 3000 lines.

The exact way in which you created this file is probably very important. If
you did it by naive copy-and-paste, then it's likely that the duplicate
values in the file are very close to each other and that the distance
between duplicates is consistently small.

This means that the pattern match="value[.=preceding::value]" is not nearly
as expensive to evaluate as one might expect: it doesn't involve a search
back to the beginning of the file, only to the nearest duplicate value. I
was initially surprised that introducing keys to this example made so much
difference, given that you left this match pattern unchanged; but the
explanation for this would lie in the predictable data distribution.

By contrast, this code

>             <xsl:attribute name="xrefid"
> select="preceding::value[.=current()][last()]/@oid"/>

is looking for the first (i.e. most distant) duplicate, which will in
general be a long search.

I suspect that with most real-life data, evaluating the pattern would be
much more expensive than it is with this synthetic data, and therefore
introducing the key would have less impact.

>I have a feeling that optimization in EE is not for "preceding axis"
problems, but for "standard" transformations of huge input files?

Saxon-EE has many optimizations, but won't optimize this particular
construct. The reason is that the sequence to which the predicate is applied
(preceding::value) is only searched once; it's not worth building an index
to support a search that is only done once.

I'm always on the look-out for new optimization opportunities (though they
have to occur in constructs that are frequent enough to justify the extra
complexity). I think it would be potentially possible to extend the indexing
optimization to this case by recognizing that one can rewrite
preceding::value[X=Y] as //value[X=Y][.<<current()], which would then be
rewritten to use an index. In fact, I suspect that rewriting preceding::X as
//X[.<<current()] might well be a worthwhile rewrite even without the
indexing, as it would take advantage of the automatic indexing of //X which
occurs even with Saxon-HE: though it needs care because the "natural" order
of the results is different.

It's worth noting that your rewrite of

select="preceding::value[.=current()][last()]/@oid"

by

select="key('value-content', .)[1]"

works only because you know (from the semantics of the match pattern for the
template) that the context item cannot be the first in its key-group.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay

Current Thread