Re: [xsl] optimization of complex XPath

Subject: Re: [xsl] optimization of complex XPath
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 19 Nov 2010 10:24:00 +0100
On 19 November 2010 10:04, Michael Kay <mike@xxxxxxxxxxxx> wrote:
> I think Saxon-EE will automatically optimize this join for you and run it
in
> O(n log n) time.
>
> Or you could hand-optimize it by running in XSLT and using keys.
>
> Both these techniques will use more memory, something which is probably in
> short supply; if you need to optimize memory consumption as well as
> execution time then you could consider some kind of filter/sort/merge
> process, which might be a lot more complex to write.

What's the fastest and memory-cheapest way to extract these links and
targets to two different (or even a single) plain text files?
Post-processing with utilities such as grep, sort -u and comm wouldn't
be "complex".
-W

>
> Michael Kay
> Saxonica
>
> On 19/11/2010 01:20, Graydon wrote:
>>
>> A bit of background before I get to the question:
>>
>> I have about 2 GiB of XML which contains links.  These links are
>> constructed as<link area="value" cite="numvalue"/>, and reference num
>> elements with matching cite values,<num cite="numvalue"/>.  (The area
>> value is associated with an ancestor of the num element.)  The guarantee
>> is that an area and cite pair is unique across the entire content set.
>>
>> The roughly-a-third of the content represented by that 2 GiB needs to be
>> checked for broken links. (The other 2/3 needs to be converted from SGML,
>> and _then_ it will need to be checked for broken links.)
>>
>> The currently converted content is all in the area "decisions", I can
>> only check links that point _to_ the area "decisions"; the internal
>> links of the already-converted content.  That makes I have about 100,000
>> out of the 250,000 links to check on this pass, but I will eventually
>> have to validate all the links in the whole (5+ GiB) content set.
>>
>> I am currently using the XPath expression:
>>
>> for $x in (//link[@area='decisions']/@cite) return
>>     (if ($x = (//num/@cite))
>>      then 'good'
>>      else concat('bad cite value ',$x,'&#x000a;')
>>     )
>>
>> to check the links. (I don't need to qualify the num element in
>> //num/@cite with [ancestor::*[1]/@area='decisions'] because all the
>> currently converted content has an @area value of "decisions".)
>>
>> This works, and dumps out out a text list of bad links (and one "good"
>> per good link, so I can count them) but it is slow in ways that make me
>> think it's at least O(N^2) and probably worse.  Performance is
>> reasonable on small subsets of the content but very slow on the whole of
>> the available XML.  Since I have to check the whole thing as a unit to
>> properly test the ~two fifths of the links that refer to the decisions
>> area, and since I will eventually have to run the check on two or three
>> times as much content, O(N^2) is a problem; I can't say it takes a week
>> to run the link check.  (Well, I can _say_ it, but it won't meet with a
>> pleased reception.)
>>
>> How can I make this check faster?
>>
>> I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or
>> (if really necessary) some other means, and would appreciate any
>> suggestions from the list's collective wisdom.
>>
>> Thanks!
>> Graydon

Current Thread