RE: [xsl] coping with huge xml-saxon

Subject: RE: [xsl] coping with huge xml-saxon
From: "Passin, Tom" <tpassin@xxxxxxxxxxxx>
Date: Mon, 16 Jun 2003 12:32:05 -0400
[XSL Chatr]

> My XML is a business critical file and it needs to be 
> converted to various
> forms like html text or RTF files.
> Its interesting you mentioned that XSLT may not be the right 
> solution. What
> else would you think of  for a business that needs this kind 
> of requirement
> to be solved? Like conversion of XML to multiple files?


So much depends on the nature of the data and what you want to do with
it.  You said your files could be 600 MB in size.  I bet that you are
not planning to produce 600 MB html files, are you?  Browsers would not
be able to handle them, and even if they could, no user would want to
try to read them.

If you were making a large book, you would break it up into separate
chapters.  If you were analyzing log file data, you could break it up
into, say, months or days.  You could break them up with some kind of
text processing or better ( since I guess the source data is in XML)
using SAX.  SAX does not have to store the whole thing in memory so you
can stream the document right through, producing your smaller pieces.

If your data comes from a spreadsheet or database, it probably contains
a large number of rows representing the records.  There must be some
criteria by which you want to break them up for presentation.  Use SAX
to break them up by those criteria.

Once you have the pieces, you can process them separately.  If you need
to combine some of them, or create totals over the entire data set, you
can use xslt to extract the data you need to work with from the separate
files you created with SAX (or whatever).

If you work this out right, you will probably find that you will be able
to drive it with a batch file so that you can repeat it on new data
whenever necessary.  You may also end up creating some xml driver files
that contain the information for an xslt processor to assemble the
pieces.  Plan on using a series of steps with intermediate results
rather than doing everything at once.  With luck, the pieces will be
useful in themselves (like log files by month).

Remember that a divide-and-conquer approach, which is basically what I
am describing - can work wonders, not only in reducing the memory load
but also in increasing the speed of processing.  As an example The Fast
Fourier Transform is an example of divide-and-conquer.  A brute-force
spectral transformation takes O(n^2) time, which is prohibitive for
large data set.  Using the FFT technique reduces the time to O(ln n),
which can make for huge decreases in processing time.  

You can probably get similar benefits be dividing the task up
judiciously.

Cheers,

Tom P

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread