Re: [xsl] Incremental transformations with Xalan and performance issues?

Subject: Re: [xsl] Incremental transformations with Xalan and performance issues?
From: Marian Olteanu <mou_softwin@xxxxxxxxx>
Date: Sat, 4 Dec 2004 22:46:42 -0800 (PST)
--- Andrzej Jan Taramina <andrzej@xxxxxxxxxxx> wrote:

> I'm in a situation where I need to parse some large documents, where the 
> first few elements are a preamble with various parameters and the end of the 
> document is a large list of entries.
> 
> Think of a mail merge, where the letter to be sent is defined first in the 
> mail merge xml, followed by numerous recipient entries, something like this:
> 
> <mailmerge>
> 	<letter>
> 		...letter def goes here
> 	<letter>
> 	<recipients>
> 		<recipient>
> 			...recipient data
> 		</recipient>
> 		<recipient>
> 			...recipient data
> 		</recipient>
> 		etc...
> 	</recipients>
> <mailmerge>
> 
> What I was wondering was how Xalan handles the processing of such large 
> documents (say a million recipient entries) when the parser is using SAX?
> 
> More specifically, if I create global variables such as:
> 
> 	<xsl:variable name="letterTemplate" select="/mailmerge/letter"/>
> 
> then later:
> 
> 	<xsl:template match="recipients/recipient>
> 		<!-- process the recipient using $letterTemplate -->
> 	</xsl:template>
> 
> Will the processing be incremental in nature, as SAX events are received by 
> Xalan?  That is, is Xalan smart enough to create the global as soon as it 
> can, followed by processing of each individual recipient as each related SAX 
> event is received?  In that case, having the shared global info early in the 
> document and the large list at the end would probably have beneficial 
> performance implications.
> 
> Or will the whole document have to be instantiated as some sort of internal 
> tree first?
> 
> Hopefully, it's incremental in nature, since otherwise we might blow out 
> memory with such large documents.
> 
> Any insight into the implications of processing such large documents, using 
> globals, xslt stylesheet structure, impact of element ordering in the 
> document and the like would be very much appreciated.
> 
> Thanks!
> 

First of all, my experience says that if you are concerned about performance, stay away from
Xalan. I must admit that I wasn't concerned about XSLT and speed since Summer of 2002 (when school
made me work at a XSLT compiler (in which I was focused about speed, but not about incremental
parsing :-D , because I didn't really find a good application for it)) and testing different
processors I got the following results:
	AXXEL/1	AXXEL/3	XSLTC	XALAN 	MSXML4	MSXML3 	SAXON 
Mo.xsl	1352	3155	2564	61950	2379	10451	3985
Sh.xsl	250	1713	***	6205	655	1787	681
n-s.xsl	1041	1321	1201	4897	1065*	2243	2825

* = wrong output
*** = coundn't compile

Processors:
AXXEL/1 - my project: XSLT compiled to Java sourcecode, output fully suppressed (JVM)
AXXEL/3 - my project: XSLT compiled to Java sourcecode, with output (JVM)
XSLTC - XSLT to Java bytecode, found in Xalan (JVM)
SAXON - SAXON 6.5.2 (JVM)
XALAN - XALAN 2.3.1 (JVM)
MSXML3 - Microsoft MSXML 3.0
MSXML4 - Microsoft MSXML 4.0


Tests:
mo.xsl - a XML2HTML presentation sheet, fairily complex (a lot of templates and a lot of modes).
Artificially run 100 times (the main template: run the stylesheet 100 times, without re-parsing of
the input XML)
sh.xsl - a XML2HTML presentation sheet, quite simple. Run internally 100 times, except for MSXML3
and MSXML4 (I don't remember why, but it didn't work) for which the time for executing once was
multiplied by 100
n-s.xsl (number-string.xsl) - an artificial stylesheet, to test the computation power for the
string value of a node (i.e: how fast you compute string(/) ), the speed of normalize-space.

For Java processors, JDK 1.4.0 was used (HotSpot client). The time was computed after the hot spot
compiler did its job (simulation of server-side environment) .

I must admit, tests were performed with mid-2002 software, but as you can see, Xalan is way worst
than anything else tested, MSXML 4.0 works great (written in C++) and SAXON is very close behind
(although it is written in Java). Xalan was 10 to 15 times slower than SAXON (on real
stylesheets).
What I also found out is that Java is not great at I/O in XSLT transformation: file manipulation
and string manipulation is quite slow.

Maybe the things have changed changed in 2.5 years, but I doubt that people from Apache foundation
learned how to write fast software. Latest release of Xalan is 2.6 and latest releas of Saxon is
8.1.1. Still, latest release of MSXML is 4.0. I also bet that they didn't change much in XSLTC




About the big XMLs issue: I recomend you not to expect any magic from a XSLT processor (like
efficient incremental parsing) and make all your XMLs small by dividing the information into more
than an XML (which later you can access them using "document" function). For example, you may take
the mail content into a separate XML file if you don't access this info too often. In my
experience, any XML over 3 or 5 MB is a bad XML.
More, don't expect that after you used an external XML (using "document" function) and you have no
refference to it any more, the XSLT processor will free the XML tree for that external XML.


=====
Marian
http://www.utdallas.edu/~mgo031000/


		
__________________________________ 
Do you Yahoo!? 
Yahoo! Mail - Find what you need with new enhanced search.
http://info.mail.yahoo.com/mail_250

Current Thread