Subject: Re: [xsl] optimization of complex XPath From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx> Date: Fri, 19 Nov 2010 10:24:00 +0100 |
On 19 November 2010 10:04, Michael Kay <mike@xxxxxxxxxxxx> wrote: > I think Saxon-EE will automatically optimize this join for you and run it in > O(n log n) time. > > Or you could hand-optimize it by running in XSLT and using keys. > > Both these techniques will use more memory, something which is probably in > short supply; if you need to optimize memory consumption as well as > execution time then you could consider some kind of filter/sort/merge > process, which might be a lot more complex to write. What's the fastest and memory-cheapest way to extract these links and targets to two different (or even a single) plain text files? Post-processing with utilities such as grep, sort -u and comm wouldn't be "complex". -W > > Michael Kay > Saxonica > > On 19/11/2010 01:20, Graydon wrote: >> >> A bit of background before I get to the question: >> >> I have about 2 GiB of XML which contains links. These links are >> constructed as<link area="value" cite="numvalue"/>, and reference num >> elements with matching cite values,<num cite="numvalue"/>. (The area >> value is associated with an ancestor of the num element.) The guarantee >> is that an area and cite pair is unique across the entire content set. >> >> The roughly-a-third of the content represented by that 2 GiB needs to be >> checked for broken links. (The other 2/3 needs to be converted from SGML, >> and _then_ it will need to be checked for broken links.) >> >> The currently converted content is all in the area "decisions", I can >> only check links that point _to_ the area "decisions"; the internal >> links of the already-converted content. That makes I have about 100,000 >> out of the 250,000 links to check on this pass, but I will eventually >> have to validate all the links in the whole (5+ GiB) content set. >> >> I am currently using the XPath expression: >> >> for $x in (//link[@area='decisions']/@cite) return >> (if ($x = (//num/@cite)) >> then 'good' >> else concat('bad cite value ',$x,'
') >> ) >> >> to check the links. (I don't need to qualify the num element in >> //num/@cite with [ancestor::*[1]/@area='decisions'] because all the >> currently converted content has an @area value of "decisions".) >> >> This works, and dumps out out a text list of bad links (and one "good" >> per good link, so I can count them) but it is slow in ways that make me >> think it's at least O(N^2) and probably worse. Performance is >> reasonable on small subsets of the content but very slow on the whole of >> the available XML. Since I have to check the whole thing as a unit to >> properly test the ~two fifths of the links that refer to the decisions >> area, and since I will eventually have to run the check on two or three >> times as much content, O(N^2) is a problem; I can't say it takes a week >> to run the link check. (Well, I can _say_ it, but it won't meet with a >> pleased reception.) >> >> How can I make this check faster? >> >> I can run this as an XQuery via eXist, via XPath 2.0 using Saxon 9.2, or >> (if really necessary) some other means, and would appreciate any >> suggestions from the list's collective wisdom. >> >> Thanks! >> Graydon
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] optimization of complex X, Michael Kay | Thread | Re: [xsl] optimization of complex X, Michael Kay |
Re: [xsl] optimization of complex X, Michael Kay | Date | Re: [xsl] XSLT compiler and syntax , ac |
Month |