RE: [xsl] Testing by counting or positional predicate

Subject: RE: [xsl] Testing by counting or positional predicate
From: Kay Michael <Michael.Kay@xxxxxxx>
Date: Thu, 11 Jan 2001 15:18:52 -0000
> I guess that this will depend on when the node set is constructed: if
> it's constructed in advance of the test, then the first is probably
> better because it just tests two numbers as opposed to iterating over
> the nodes, unless $length is very small in which case they're probably
> roughly equivalent.
> 
> If the node set is constructed on the fly and it's relatively likely
> that it's at least $length in length, then it's probably better to use
> the second because it prevents having the whole node set constructed
> before its length is tested.  If it's unlikely that the node set will
> be of length $length, then they're probably roughly equivalent.
> 
> Would that be an accurate analysis?

Yes. Almost. But its actually more complicated. For example, it depends on
what else you're going to do with the node-sets. As a general rule, if
you're going to throw it away, then try to avoid materialising it; but if
you're going to use it anyway, then you might as well materialise it.

Say you want to do

<xsl:if test="count(//*) > 200000">
   <xsl:message> too many nodes </xsl:message>
</xsl:if>

Here //* will be automatically delivered sorted and deduplicated, but Saxon
will scan and count all the elements in the document, even if there are ten
million. It won't, however, build the node-set in memory.

If you change this to:

<xsl:if test="not((//*)[200000])">
   <xsl:message> too many nodes </xsl:message>
</xsl:if>

Then it will scan all the elements in the document, testing each one to see
if it is number 200000, and when it finds one that is, it will stop the
scan.

If >200000 is an exception condition, then in the normal case both will
follow the same access path so there shouldn't be a difference.

A different example, we want to display all the nodes if there's less than a
hundred, but only the first fifty otherwise.

<xsl:variable name="nodes" select="//item[color='green']"
<xsl:variable name="chunk" 
    select="$nodes[count($nodes)<100] | $nodes[position() < 50]"/>
<xsl:for-each select="$chunk">
  <xsl:value-of select="."/>
</xsl:for-each>

I have great trouble working out what Saxon will do with this and I might be
wrong!

//item[color='green'] (I think) does not guarantee to deliver nodes in
sorted order. Evaluating the first <xsl:variable> element will create an
"intensional node-set", i.e. the variable is not immediately evaluated, all
that happens is that the expression is "absolutized" which in this case is
done by noting the current document. Evaluating the second variable is the
same. Then the <xsl:for-each> asks for $chunk to be delivered in document
order, and because $chunk is a union expression it asks for both halves to
be delivered in document order, and merges the two streams of nodes. When it
asks for the nodes in $nodes[count($nodes)<100], this starts by doing a
reduction of the predicate, which in this case amounts to a full evaluation
because the predicate doesn't depend on the context node or position. So it
evaluates count($nodes)<100, which requries sorting and deduplicating
$nodes, which causes a scan of the //item elements. The result of this scan
is saved in memory because it has to be sorted (if no sort were required,
we'd save it in memory only on the third time it is scanned: a crude
space-time trade-off). After that evaluating $nodes again works on the
sorted in-memory copy. So, the count($nodes) ensures that all the //item
elements will be scanned, regardless how many there are.

If we rewrote this as
<xsl:variable name="chunk" 
    select="$nodes[not($nodes[100])] | $nodes[position() < 50]"/>

the same would happen, because the filter expression $nodes[100] also
requires the node-set to be sorted and deduplicated.

But if the original variable were
<xsl:variable name="nodes" select="item[color='green']"

things would be different, because this node-set is intrinsically sorted.
This time the count() formulation would actually count the items for which
the predicate is true, while the not() formulation would stop after 100. If
there are less than 100 nodes, it wouldn't make any difference, if there are
more, it could. But it's complicated by the fact that there are three
variable-references to $nodes here, which is enough to trigger
materialization of the node-set even though it doesn't need to be sorted.

The bottom line is: it's complicated, and not easily predictable!

I'm reminded of something I used to teach in relational database courses:
leave it to the optimizer, unless you're desperate. If you're desperate,
make changes empirically, making one change at a time and measuring and
recording the results; if a change makes no difference, undo it. Don't try
to understand the query execution plan unless you're REALLY desperate!

Mike

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread