Subject: Re: [xsl] optimization of complex XPath|
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Fri, 19 Nov 2010 09:04:30 +0000
Michael Kay Saxonica
A bit of background before I get to the question:
I have about 2 GiB of XML which contains links. These links are constructed as<link area="value" cite="numvalue"/>, and reference num elements with matching cite values,<num cite="numvalue"/>. (The area value is associated with an ancestor of the num element.) The guarantee is that an area and cite pair is unique across the entire content set.
The roughly-a-third of the content represented by that 2 GiB needs to be checked for broken links. (The other 2/3 needs to be converted from SGML, and _then_ it will need to be checked for broken links.)
The currently converted content is all in the area "decisions", I can only check links that point _to_ the area "decisions"; the internal links of the already-converted content. That makes I have about 100,000 out of the 250,000 links to check on this pass, but I will eventually have to validate all the links in the whole (5+ GiB) content set.
I am currently using the XPath expression:
for $x in (//link[@area='decisions']/@cite) return (if ($x = (//num/@cite)) then 'good' else concat('bad cite value ',$x,'
to check the links. (I don't need to qualify the num element in //num/@cite with [ancestor::*/@area='decisions'] because all the currently converted content has an @area value of "decisions".)
This works, and dumps out out a text list of bad links (and one "good" per good link, so I can count them) but it is slow in ways that make me think it's at least O(N^2) and probably worse. Performance is reasonable on small subsets of the content but very slow on the whole of the available XML. Since I have to check the whole thing as a unit to properly test the ~two fifths of the links that refer to the decisions area, and since I will eventually have to run the check on two or three times as much content, O(N^2) is a problem; I can't say it takes a week to run the link check. (Well, I can _say_ it, but it won't meet with a pleased reception.)
How can I make this check faster?
I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or (if really necessary) some other means, and would appreciate any suggestions from the list's collective wisdom.