Re: [xsl] efficient traversal of combined collections in XSLT 3.0

Subject: Re: [xsl] efficient traversal of combined collections in XSLT 3.0
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Sat, 24 Nov 2012 15:27:24 +0000
The way we do this in maintaining the XSLT/XQuery specs (admittedly much smaller than your 4GB) is to maintain a derived document containing a list of valid link targets. This is regenerated when the base documents change, which is less frequently than the list is used. The list of valid anchors is much smaller than the base documents, so it can be loaded more quickly, and uses less memory.

Also, generating the list of anchors is an operation that can be streamed; hopefully the resulting list is small enough that it can be held in memory for look-up purposes.

Michael Kay
Saxonica

On 24/11/2012 13:53, Graydon wrote:
So I have about 4.0 GB of "production" content, XML that's already in use, can have deliverables generated from it, and which various groups of editors may change.

I have "content", some content (generally about .2 or .25 GB) that is being converted from SGML and which, before it is added to "production", needs to be checked to see if the links in it work.

links use a combination of @area (the name of a uniqueness of numbers) and @cite (the number); this is for legislation, so the numbers can get complicated by the basic scheme is pretty simple. (targets are one direction in a bi-directional relationship, so a link in a fancy hat; they usually contain links, and we only need to check them if they _don't_ contain a link.)

The slightly tricky bit is that I want to check the links in "content" to see if they match something in "content" _and_ in "production"; XSLT 3.0's version of key() will accept an arbitrary top-node, so (using the Saxon 9.4 which ships with current, 14.1 oXygen) I can declare the stylesheet to be version 3.0, combine "production" and "content" into "searchSpace", and define a key on that.

<xsl:stylesheet exclude-result-prefixes="xs xd" version="3.0"
   xmlns:xs="http://www.w3.org/2001/XMLSchema"; xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
   <xsl:variable name="content" select="collection('file:///home/graydon/stages/APFF?recurse=yes;select=*.xml')"/>
   <xsl:variable name="production"
     select="collection('file:///home/graydon/stages/production/2012-11-13?recurse=yes;select=*.xml;on-error=ignore')"/>
   <xsl:variable name="searchSpace" select="($content,$production)"/>
   <xsl:key match="*[num[@cite]]" name="places" use="concat(ancestor-or-self::*[@area][1]/@area,'|',num[1]/@cite)"/>
   <xsl:template match="/">
     <bucket>
       <xsl:for-each select="$content//link,$content//target[not(reference-text/link)]">
         <xsl:choose>
           <xsl:when test="key('places',concat(current()/@area,'|',current()/@cite),$searchSpace)">
             <good>
               <uri>
                 <xsl:sequence select="base-uri(.)"/>
               </uri>
               <xsl:sequence select="."/>
             </good>
           </xsl:when>
           <xsl:otherwise>
             <bad>
               <uri>
                 <xsl:sequence select="base-uri(.)"/>
               </uri>
               <xsl:sequence select="."/>
             </bad>
           </xsl:otherwise>
         </xsl:choose>
       </xsl:for-each>
     </bucket>
   </xsl:template>
</xsl:stylesheet>

This works well on content-sized chunks of input (.25 GB or so) and I get an answer in about 15 seconds.

It doesn't work on the full data set; 16 GB of RAM isn't enough to do this to 4 GB of data. Various wheels are in motion to get more RAM.

So maybe everything will be fine, but I can't help looking at that code and going "this is a really naive search; there has to be a more efficient way to do this."

So, O XSLT List, what's the more efficient way to do this?

Thanks!

-- Graydon

Current Thread