RE: [xsl] Latest XSLTMark benchmark

Subject: RE: [xsl] Latest XSLTMark benchmark
From: "Michael Kay" <mhkay@xxxxxxxxxxxx>
Date: Mon, 2 Apr 2001 11:59:33 +0100
> > I have just had a look at the Saxon driver and think you
> are correct. This
> > also appear to have been true in the earlier (1_2_1)
> release as well. It was
> > clearly intended that the loadInput() call should actually
> load the input as
> > in the other drivers not just open an input stream.
> It was there in version 1.1.1, as well. Everyone, including
> Michael Kay, missed
> it -- the disagreement between the documentation, the code in
> the other drivers,
> and the Saxon driver. The intention (which may not be
> practical) was always to
> measure just the XSLT transformation.

I missed it because the intention certainly wasn't clear. In fact, the
documentation refers anyone who wants to write a driver to the supplied
driver for xt, which builds the source document tree inside the loop, not
outside it.

I do think that the benchmark should be measuring parsing plus
transformation plus serialization, because that is the most typical usage
scenario, and because if you don't measure that, a processor that optimizes
parsing or serialization based on knowledge of the stylesheet gets no credit
for it. For example, a processor might be able to do faster serialization if
it knows that neither the source document nor the stylesheet contains any
Unicode characters outside the target encoding, or it might be able to use
stylesheet information to achieve faster parsing (perhaps avoiding entity
expansion in parts of the document that are never accessed, for example, or
perhaps not parsing at all sections of the document beyond those required by
the transformation).

The most likely situation that affects current processors, however, is
whitespace stripping. There are basically three ways to do whitespace
stripping: do it while building the tree, modify the tree after building it,
or leave the whitespace on the tree but ensure that it has no effect. (There
is a fourth way, which is to decide not to conform to the standard.) The
first approach is by far the most efficient, but by insisting that the tree
is built without any knowledge of the stylesheet you are effectively ruling
it out. In Saxon's case this will force the second approach, which is far
more expensive.

There is a trade-off between time taken to build the tree and time taken to
do the transformation. Saxon is trying to minimize the sum of these two
activities. The Saxon "tinytree" model deliberately reduces the time spent
building the tree at the expense of the time spent navigating it, based on
the observation that the average number of visits made by a stylesheet to
each source node is often about one. Your proposal will encourage
implementors to spend more time building the tree in order to spend less
time transforming it, which is not necessarily the right design approach for
"real life".

The current approach is also open to "cheating". For example, it would be
quite legitimate for a processor to cache the index structures used to
implement keys, so that if the same stylesheet is applied twice to the same
source document, the indexes do not need to be rebuilt. Implementing such a
cache will boost a processor's rating in this benchmark far more than the
technique actually warrants in real life.

Mike Kay
Software AG

 XSL-List info and archive:

Current Thread