Re: [xsl] slow xsltproc XInclude processing w/complex document?

Subject: Re: [xsl] slow xsltproc XInclude processing w/complex document?
From: Paul DuBois <paul@xxxxxxxxxxxx>
Date: Wed, 7 Jul 2004 13:10:55 -0500
Yesterday, I said:

At 14:58 -0500 7/6/04, Paul DuBois wrote:
I've been running some tests on a document that includes nested
Xinclude directives. The document is complex: upwards of 1500 files,
nested to a depth of up to 4 levels.  Total size of content is about 4.8MB.

For simple testing, I'm attempting only to produce a "flattened"
document that just resolves the XIincludes.  Stylesheet looks like
this:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>


<!-- Identity transform, but "flatten" xincludes -->

<xsl:output method="xml" indent="yes"/>
<xsl:preserve-space elements="*"/>

<xsl:template match="xi:include" xmlns:xi="http://www.w3.org/2001/XInclude";>
  <xsl:for-each select="document(@href)">
    <xsl:apply-templates/>
  </xsl:for-each>
</xsl:template>

<!-- identity transform -->

<xsl:template match="/ | node() | @* | comment() | processing-instruction()">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

</xsl:stylesheet>

Follow-up/status report:


A couple of folks suggested some improvements to the stylesheet, mostly
aimed at eliminating loops and unnecessary node visits.  Thanks all.
The suggestions, however, made no difference at all. (The resulting
execution times were reliable to within a few seconds to my original
attempts.)

One thing I notice while watching xsltproc more closely is that the
size of the output file remains zero for a long time and then, BOOM!!!,
I get 4.8 MB on disk in a couple of seconds.  During the time before
xsltproc writes anything, I see its memory use slowly climb.  (Depending
on machine, it end up getting to about 30-50MB.

My two test machines have 640MB and 1GB RAM, so I don't think that's an
issue.  Given that xsltproc can execute an identity transform on the flattened
file in a few seconds, a very uneducated guess is that it is simply much
less efficient at constructing the document from fragments in XInclude files
than when it can just read the entire document in as a stream.

I did some investigation into Jeni's suggestion of using a SAX-based transform
to resolve the XIncludes.  I think this could be workable:  Use that transform
as a front end to piping the resulting flattened document into xsltproc
to perform other transforms.

I'm getting somewhat mixed results here.  I discovered Matt Sergeant's
XML::Filter::XInclude Perl module and tried that.  At first, it didn't
work at all; then I discovered that my input document was specifying
a namespace of xmlns:xi="http://www.w3.org/2003/XInclude"; and the module
wants to see xmlns:xi="http://www.w3.org/2001/XInclude"; instead.

(Digression: I think I'm confused about which namespace URI to use here.
http://www.w3.org/2003/XInclude indicates that the 2003 form is deprecated
and that the 2001 form should be used instead.  On the other hand, the
source code for libxml2 recognizes both, but refers to the 2001 form as
the deprecated one.  Hmm...)

Once I changed my input document to use the 2001 form that XML::Filter::XInclude
wants to see, it worked partially. Some investigation reveals that it fails
when the document contains two successive XInclude elements. That is,
this can work:


<x><xi:include ... ></x>
<x><xi:include ... ></x>

But this will fail:

<xi:include ... >
<xi:include ... >

A quick look at the module source convinced me that I don't understand
what to patch to make it work. :-)

I also ran across a simple Perl XIncluder by Kip Hampton at:
http://www.xml.com/pub/a/2001/10/10/sax-filters.html

This one shows some promise.  I notice a few quirks
here, as well, but perhaps I am on the road to success.

Thanks again.


Current Thread