[xsl] RE: Incremental transformations with Xalan and performance issues?

Subject: [xsl] RE: Incremental transformations with Xalan and performance issues?
From: "Andrzej Jan Taramina" <andrzej@xxxxxxxxxxx>
Date: Sat, 04 Dec 2004 18:50:59 -0500
Michael:

Thanks for the response. BTW, I use your XSLT book as my primary reference...nice work!

> You might find it better to ask such questions on the xsl-list at
> mulberrytech.com, or if you're really interested only in Xalan, on a
> Xalan-specific forum.

Like many, I suffer from YAL syndrome.  (Yet another list) and am hesitat to 
sub to any more lists, given how much stuff I already receive.  I knew some 
XSLT heavyweights (like yourself) hang here, and hence the decision to post 
to the xml-dev group.  However, I've also now x-posted to the xsl group as 
well.  

I also think that as XML adoption continues to accelerate, transformations of extremely 
large documents using XSLT will be more and more a general concern to the community.

> In general, every mainstream XSLT processor today builds a tree
> representation of the input document in memory. I believe Xalan does parsing
> and transformation in parallel, but it still builds the tree. The fact that
> the parser and the transformer communicate using SAX is irrelevant - it just
> means that the transformer and not the parser is building the tree. (This
> isn't totally irrelevant, because the transformer can build a much more
> efficient tree knowing it is read-only. But it's still an in-memory tree.)

I might have to redesign how we handle our XML in that case, to keep each mailmerge 
recipient entry in a separate document, rather than have the whole thing as one 
monolithic document.

Do you happen to know if anyone has tried to build an XSLT engine that does incremental 
transformations on incoming SAX events, without requiring the building of a tree?  That 
kind of approach, where the transform is appropriate, would be much more efficient in 
memory useage and would allow transforms of virtually unlimited size documents I should 
think.  Something to investigate...

> I can't speak for Xalan, but Saxon users are running transformations up to
> 200Mb or so without too much trouble, and at speeds up to 10Mb/sec. It
> requires a little care in configuring the memory allocation, and in writing
> the stylesheet to avoid non-linear constructs, but it's certainly doable.
> Beyond that, it probably gets difficult. 

I'm using Xalan (inside Cocoon), and for this task have not yet figured out a way to use 
Saxon due to some extensions I'm using.  More specifically, I need to get/put stuff into 
the session and using something like this (in Xalan):

<xalan:component prefix="javaSession">
	<xalan:script lang="javaclass" 	
					src="xalan://org.apache.cocoon.environment.Session"/>
</xalan:component>

Then have templates like:

<xsl:template name="javaCall:setSessionAttribute">
	<xsl:param name="attributeName" select="'unknown'" />
	<xsl:param name="attributeValue"/>
	<xsl:param name="session"/>
		
	<xsl:variable name="dummy" 
		select="javaSession:setAttribute( $session, 	$attributeName, $attributeValue )"/>
</xsl:template>
	
<xsl:template name="javaCall:getSessionAttribute">
	<xsl:param name="attributeName" select="'unknown'" />
	<xsl:param name="session"/>
		
	<xsl:copy-of select="javaSession:getAttribute( $session, $attributeName )"/>
</xsl:template>

The session parameter is a reference to the user's session that is passed in from the 
calling stylesheet with a bit of magic from a custom Cocoon transformer class.

This works fine with Xalan, if you save a tree fragment, and then retrieve it, you end up 
with a node list/tree fragment as desired.  With Saxon, however, if I instead use the 
saxon component definition:

<saxon:script language="java" 
				implements-prefix="javaSession" 
				src="java:org.apache.cocoon.environment.Session"/>

I can save a result fragment, but when I retrieve it, I don't get a node list/tree 
fragment.  Haven't figured out how to correct this yet with Saxon.

If it wasn't for this, I could freely change between the two XSLT engines with a build 
parameter.

> You don't actually say what you mean
> by a "large document". (Personally, I am amazed to see people handling a 200Mb
> database as a single in-memory document, but perhaps I'm just old-fashioned).

I'm not sure yet...the client has not given me any indication of how big the mail merge 
might be.  1M letters would make hit the database limit of 2GB for the xml document in 
the table column (clob).  100K letters would hit the 200MB level that you mentioned.

I'ld rather implement a solution that has no limitations, so with the lack of a true 
"incremental/SAX" based transformer implementation,  I'm thinking that I'll need to move 
away from the monolithic document approach and store each recipient's info in a separate 
small document to work around the current xslt document size limitations.

> If you really need purely serial processing, you might consider STX as an
> alternative. However, the existing STX implementations are far less
> widely-used or mature than the popular XSLT implementations.

That's not an option in our case, since we rely on xslt so much.


Andrzej Jan Taramina
Chaeron Corporation: Enterprise System Solutions
http://www.chaeron.com

Current Thread