RE: [xsl] optimization for very large, flat documents

Subject: RE: [xsl] optimization for very large, flat documents
From: Pieter Reint Siegers Kort <pieter.siegers@xxxxxxxxxxx>
Date: Tue, 18 Jan 2005 18:30:12 -0600
Hi Kevin,

It has to do with the way the input source is built in memory, i.e a
tree-like structure with relationships; likewise, the result that the XSLT
produces is actually another tree which is serialized by the application
that saves it to disk. Serialization is not part of the XSL Transformation.
On disk, it is stored in a very different way.

In the end, to work with a big input source, or a big stylesheet, both need
to be parsed and rebuilt *completely* into memory to obtain the same
tree-like structure that it was before it was saved.

You could use a SAX like approach, but I'm not sure as how to do that -
maybe others can jump in here.

But, if as you say your entries are independent from each other, then
another approach (one I would risk to take) is to read your big xml source
in chunks, and process them independently as if they were separate xml input
sources, and saving the results to a common file. Of course reading in
chunks must not be done by the XML parser and tree builder - it could be
very well an application that opens the file, reads in a chunk, passes it to
the XSL processor, etc. etc.

A final observation/question is why that big xml of 600MB was created in the
first place? I would have chosen to fill a database, and query from there,
applying the XSLTs, and save the result in the format you'd need. But then
again, you or the one that processed the file must have had its reasons to
create it as one single file...

I would wish a filesystem that can store an input source (whether xml or
xsl) as a direct representation of the tree-like structure, that way we
would be freed of using lots of memory (and gain performance?).

Cheers,
<prs/>

-----Original Message-----
From: Kevin Rodgers [mailto:kevin.rodgers@xxxxxxx] 
Sent: Martes, 18 de Enero de 2005 12:05 p.m.
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: [xsl] optimization for very large, flat documents

I'm trying to process a very large (600 MB) flat XML document, a
bibliography where each of the 400,000 entries is completely independent of
the others.  According to the Saxon web site and mailing list, it'll take
approx. 5-10 times that (3 GB) to hold the document tree in memory, which is
impractical.  The Saxon mailing list also has some tips about how to
accomplish this, but my question is: Why doesn't XSLT provide a way to
specify that a matched node can be processed independently of its
predecessor and successor siblings?  Alternatively, couldn't an XSLT
processor infer that from the complete absence of XPath expressions that
refer to predecessor and successor siblings?

--
Kevin Rodgers

Current Thread