[xsl] [Summary] An efficient XSLT program that searches a large XML document for all occurrences of a string

Subject: [xsl] [Summary] An efficient XSLT program that searches a large XML document for all occurrences of a string
From: "Roger L Costello costello@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 3 May 2024 07:48:36 -0000
Hi Folks,

Thank you for your excellent responses!

I decided to go the XSLT streaming route. Below is my summary of how to do
it.

First, the problem statement:

I have an XSLT program that locates all leaf elements which have the string
value 'DNKK'. My program outputs the element and the name of its parent:

    <xsl:template match="/">
        <results>
            <xsl:for-each select="//*[not(*)][. eq 'DNKK']">
                <result>
                    <xsl:sequence select="."/>
                    <parent><xsl:value-of select="name(..)"/></parent>
                </result>
            </xsl:for-each>
        </results>
    </xsl:template>

The input XML document is large, nearly 5GB.

When I run my XSLT program, SAXON throws an OutOfMemoryError message.

To solve the OutOfMemoryError I could add to my heap space (-Xmx) when I
invoke Java. I tried that, adding as much as 10GB of heap space, and I still
got the OutOfMemoryError message.

So I went with XSLT streaming. Here's how to do it.

The SAXON documentation [1] says this: "Using the xsl:source-document
instruction, with the attribute streamable="yes". Here the source document is
identified within the stylesheet itself. Typically such a stylesheet will have
a named template as its entry point, and will not have any principal source
document supplied externally."

So, I created an XSLT document containing just a named template. Michael Kay
showed how to reformulate my XSLT code to be streamable. Here's the complete
XSLT program:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                           version="3.0">

    <xsl:template name="test">
        <xsl:source-document href="Input.xml" streamable="yes">
            <results>
                <xsl:for-each select="//text()[. eq 'DNKK']">
                    <result>
                        <xsl:element name="{name(..)}">DNKK</xsl:element>
                        <parent><xsl:value-of select="name(../..)"/></parent>
                    </result>
                </xsl:for-each>
            </results>
        </xsl:source-document>
    </xsl:template>

</xsl:stylesheet>

I saved that to the file get-records.xsl

Then I opened a command window and typed this:

java -classpath %CLASSPATH% net.sf.saxon.Transform -it:test
-xsl:get-records.xsl -o:results.xml

The SAXON documentation [2] says this about the -it (initial template) flag:

-it[:template-name] Selects the initial named template to be executed.

I ran it and it worked beautifully!

/Roger

[1]
https://www.saxonica.com/html/documentation10/sourcedocs/streaming/xslt-strea
ming.html
[2]
https://www.saxonica.com/documentation12/index.html#!using-xsl/commandline

Current Thread