Re: [xsl] Transforming large XML docs in small amounts of memory

Subject: Re: [xsl] Transforming large XML docs in small amounts of memory
From: Ronan Klyne <ronan.klyne@xxxxxxxxxxx>
Date: Mon, 30 Apr 2007 11:00:31 +0100
Andrew Welch wrote:
> Much can be done, but your available options all depend on the
> processor and environment you're running, and how flexible you are -
> is it a pure XSLT 1.0/2.0 solution you're after, or can you use
> extensions or modify the processing pipeline?

It's purely XSLT 1.0, using Saxon (on Linux and Windows, if that
matters...), although suggestions to change this would not be shunned.
The input XML is the only real fixed quantity, due to the amount of work
that would be required to change the code generating it, given that it
already 'works'.

> Also you need to let us know:
> 
> - Is the input uniform chunks of data in a single file?  (likely if
> its a "data-centric" xml file) or does the processing require access
> to the whole input for the whole transform?

The majority of the XSL draws on data from all over the input document,
which I suspect will be constraining. There are substantial sections of
the input document which could be described as uniform, but I would not
say that the term applies to the document as a whole.

> - What is your current memory usage?  Whats the limit, what is an
> acceptable bound? etc..

The servers we're using have several Gb of memory in them, but my
objective is to increase the potential for concurrency, by reducing the
resource requirements of each transform. I think that transforming 150Mb
of data in 400Mb of RAM would be a sensible target (is this sensible?)

> - How are you measuring memory usage?  Is it simply the input XML that
> is using up all available memory, or do other parts of the pipeline
> use a lot of memory too?

I'm measuring it by increasing the maximum amount of memory available to
Java until it runs without throwing OutOfMemory errors (to solve the
immediate problem). The larger transforms (150Mb of input) are taking
~1Gb of memory to run. I'm not sure how to tell what proportion of the
memory is used for the input DOM, output DOM, etc...
Which reminds me, I should mention that the output document is ~<1Mb

	# r

-- 
Ronan Klyne
Business Collaborator Developer
Tel: +44 (0)870 163 2555
ronan.klyne@xxxxxxxxxxx
www.groupbc.com

Current Thread