Re: [xsl] [xslt performance for big xml files]

Subject: Re: [xsl] [xslt performance for big xml files]
From: Liam Quin <liam@xxxxxx>
Date: Sat, 25 Apr 2009 12:40:10 -0400
On Fri, Apr 24, 2009 at 06:25:47PM -0700, Aditya Sakhuja wrote:
> I am looking for some tips on performance tuning for my xslt  which
> processes a 30 MB file on an  avg.  I am haivng some serious issues
> getting the processing completed under 24 hrs.
A 41MByte file here takes maybe 30 seconds to a minute to process
(using Saxon).

> 3> Does xslt processing not fit in for large xml file processing ?
> Should I try looking other stream based processing over this, if xslt
> does not scale ?

You'd need to know why it's slow.  Although that's not a particularly
large file (as Mike Kay pointed out), you can easily write a
stylesheet that uses a lot of memory (as you can in any language
of course).

One possibility is that you are using all the memory on the system
and, overall, the system is swapping to disk.

Another is that you're using // and/or preceding-sibling a lot in
xpath expressions; libxslt doesn't optimise those very well, and
the obvious implementation of // scans the entire document tree
making a huge node set of results, not just every element but
(like one of those "how many triangles are there in the figure
puzzles) every possible sub-tree.  I've seen considerable speedups
with all the processors by turning // into /top/next/thingy.

One of my first SQL queries took several hours to run; a small
fix got it going in under a second, once I understood joins
a little better.

Finally - the real reason for posting to this thread --
* If you are repeatedly scanning the same XML document and generating
  different things, such as reports or small documents, consider an
  XQuery implementation that uses an index.  E.g. for a file this
  small, the free MarkLogic or Qixz/fe engines would work fine I
  expect; the first is limited to 50MBytes (or was last I looked) and
  the second to a gigabyte.

* If you read the document once, but tend not to need to look ahead or
  behind very far, you could split the input into smaller XML files,
  perhaps with a single XML document containing any metadata you do need
  to look at, and then run xslt on each of those smaller files.
  If you're on Linux or Solaris (Oraclis?), you could use "make" to
  control it and then easily take advantage of multiple CPU cores.

* If you have written templates to do things like taylor series
  expansions to calculate sin, cosine, etc., you should consider 
  calling external functions, or making an XML file with precomputed
  results (e.g. in Perl or Python or even C).

* To restate that more clearly, you don't have to do everything in
  one step.  I've seen processing/conversions that use 10 or even
  20 steps, with small scripts each doing one thing.  If you save
  the intermediate files (at least for testing) you can also do
  simple regression testing using "diff" on the current and
  last-but-one version of the output for a given stage.
  
The 41MByte file I mentioned above is a biographical dictionary from
1811 or so, with 3 perl steps (one of which is actually rather slow,
fixing common OCR errors in the input) and on a fairly loaded system
still takes under 2 minutes to do all the stages, including checking
the XML automatically between stages and making approx. 10,000
HTML output files with XSLT.

Hope this helps, and gives you encouragement!

Liam

-- 
Liam Quin, W3C XML Activity Lead, http://www.w3.org/People/Quin/
http://www.holoweb.net/~liam/ * http://www.fromoldbooks.org/

Current Thread