RE: [xsl] Processing Memory-Hungry Data Sets with XSLT 2

Subject: RE: [xsl] Processing Memory-Hungry Data Sets with XSLT 2
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 12 Mar 2008 00:05:04 -0000
Almost any performance question is processor-specific to some extent.
However, it's not unlikely that different processors use similar
implementation techniques much of the time.

Given your description of the problem, I would be looking for unnecessary
temporary trees and copy operations. With Saxon it's usually the case that
tree-construction (xsl:variable with content and no "as" attribute) is done
eagerly, whereas sequence construction (xsl:variable with a select
attribute) is done lazily.

But with performance the devil is always in the detail, and sometimes it can
be in quite surprising places in the detail.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: Eliot Kimber [mailto:ekimber@xxxxxxxxxxxx] 
> Sent: 11 March 2008 19:51
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Processing Memory-Hungry Data Sets with XSLT 2
> 
> I'm implementing some DITA processing that is applied against 
> a large tree of maps and topics referenced from the maps in 
> order to generate HTML from the maps and the topics. There 
> are 10s of 1000s of maps and topics.
> 
> I have two processors: one is essentially an identity 
> transform that process the map tree and copies it to the 
> output with a little bit of modification. The other is the 
> XML-to-HTML transform. It is still essentially a one-to-one 
> file-to-file transform but the result files are HTML instead 
> of copies. The process essentially does a top-down process of 
> the tree of maps, which consist of either links to submaps or 
> links to topics. Submaps are loaded and their topic links 
> processed. Links to topics result in loading the target 
> topics and processing them normally to generate HTML output. 
> This obviously results in a lot of source and target 
> documents in memory. The business logic is very simple, it's 
> just a lot of data.
> 
> Using Saxon 9 the first script can process my entire corpus 
> but the second one (the HTML generator) fails about 1/2 way 
> through with an out of memory failure using the largest VM I 
> can request under OS X (2Gig).
> 
> I tried using Saxon's extension discard-document() method but 
> that appeared to have no effect (I didn't really expect it to 
> since I don't think anything referenced ever gets unreferenced).
> 
> My question is, are there any XSLT 2 techniques that might 
> help avoid this type of memory usage issue that are generic 
> (as opposed to Saxon specific)? I can think of several 
> multi-pass approaches involving the creation of intermediate 
> files that would work but time is short so I'm trying to keep 
> this as simple as I can and still have it work, so I was 
> hoping there might be some clever way to make an otherwise 
> naive top-down process more memory efficient.
> 
> If the only answer is Saxon-specific then I'll move my 
> question to the Saxon list.
> 
> Thanks,
> 
> Eliot
> --
> Eliot Kimber
> Senior Solutions Architect
> "Bringing Strategy, Content, and Technology Together"
> Main: 610.631.6770
> www.reallysi.com
> www.rsuitecms.com

Current Thread