Re: [xsl] optimization of complex XPath

Subject: Re: [xsl] optimization of complex XPath
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 19 Nov 2010 09:25:50 +0100
The O(n^2) is due to iterating over a list (for) and then iterating
over the document tree to find //num/@cite.

The following stylesheet should do it better,using xsl:key:

<?xml version="1.0"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xsl:output method="text"/>
  <xsl:strip-space elements="*"/>
  <xsl:key name="citekey" match="num" use="@cite"/>
  <xsl:template match="link">
    <xsl:value-of select="(key('citekey',@cite),'ok',concat(@cite,' is
missing'))[2]"/><xsl:text>
</xsl:text>
  </xsl:template>
</xsl:stylesheet>

I have tested it with a very primitive XML only:

<doc>
  <foo>
    <num area="decisions" cite="1"/>
    <bar>
      <link area="decisions" cite="1"/>
      <link area="decisions" cite="2"/>
    </bar>
  </foo>
</doc>

-W


On 19 November 2010 02:20, Graydon <graydon@xxxxxxxxx> wrote:
>
> A bit of background before I get to the question:
>
> I have about 2 GiB of XML which contains links.  These links are
> constructed as <link area="value" cite="numvalue"/>, and reference num
> elements with matching cite values, <num cite="numvalue"/>.  (The area
> value is associated with an ancestor of the num element.)  The guarantee
> is that an area and cite pair is unique across the entire content set.
>
> The roughly-a-third of the content represented by that 2 GiB needs to be
checked for broken links. (The other 2/3 needs to be converted from SGML, and
_then_ it will need to be checked for broken links.)
>
> The currently converted content is all in the area "decisions", I can
> only check links that point _to_ the area "decisions"; the internal
> links of the already-converted content.  That makes I have about 100,000
> out of the 250,000 links to check on this pass, but I will eventually
> have to validate all the links in the whole (5+ GiB) content set.
>
> I am currently using the XPath expression:
>
> for $x in (//link[@area='decisions']/@cite) return
>    (if ($x = (//num/@cite))
>     then 'good'
>     else concat('bad cite value ',$x,'&#x000a;')
>    )
>
> to check the links. (I don't need to qualify the num element in
> //num/@cite with [ancestor::*[1]/@area='decisions'] because all the
> currently converted content has an @area value of "decisions".)
>
> This works, and dumps out out a text list of bad links (and one "good"
> per good link, so I can count them) but it is slow in ways that make me
> think it's at least O(N^2) and probably worse.  Performance is
> reasonable on small subsets of the content but very slow on the whole of
> the available XML.  Since I have to check the whole thing as a unit to
> properly test the ~two fifths of the links that refer to the decisions
> area, and since I will eventually have to run the check on two or three
> times as much content, O(N^2) is a problem; I can't say it takes a week
> to run the link check.  (Well, I can _say_ it, but it won't meet with a
> pleased reception.)
>
> How can I make this check faster?
>
> I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or
> (if really necessary) some other means, and would appreciate any
> suggestions from the list's collective wisdom.
>
> Thanks!
> Graydon

Current Thread