Re: [xsl] An efficient XSLT program that searches a large XML document for all occurrences of a string?

Subject: Re: [xsl] An efficient XSLT program that searches a large XML document for all occurrences of a string?
From: "Dimitre Novatchev dnovatchev@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 2 May 2024 23:25:06 -0000
While Martin, Michael Kay and other people provided valuable advice towards
streaming, it is probably a good moment to raise the question why, and
should such a huge document be created and probably continuously augmented
with new, additional data.

It has been proven in practice that horizontal scaling can be implemented
much easier than vertical scaling, while the latter is quite limited. I
believe that if a large XML document cannot be split into mutually
non-overlapping and comprising subtrees (horizontally), then most likely
the complexity of this document is unnecessarily huge.

Imagine having all the data about the 100B+ stars in the Milky Way put into
a single XML document...

If I were related to any activity that collects and structures such large
quantities of data, I would envisage splitting and keeping this data into
smaller, manageable chunks, wherever possible.

Thanks,
Dimitre

On Thu, May 2, 2024 at 6:07b/AM Roger L Costello costello@xxxxxxxxx <
xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi Folks,
>
> I have an XSLT program that locates all leaf elements which have the
> string value 'DNKK'. My program outputs the element and the name of its
> parent:
>
>     <xsl:template match="/">
>         <results>
>             <xsl:for-each select="//*[not(*)][. eq 'DNKK']">
>                 <result>
>                     <xsl:sequence select="."/>
>                     <parent><xsl:value-of select="name(..)"/></parent>
>                 </result>
>             </xsl:for-each>
>         </results>
>     </xsl:template>
>
> The input XML document is large, nearly 5GB.
>
> When I run my program SAXON throws the OutOfMemoryError message shown
> below.
>
> To solve the OutOfMemoryError I could add to my heap space (-Xmx) when I
> invoke Java. But I wonder if there a way to write my program so that it is
> more efficient (i.e., doesn't require so much memory)?
>
> /Roger
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>         at java.base/java.util.Arrays.copyOf(Arrays.java:3841)
>         at
>
net.sf.saxon.tree.util.FastStringBuffer.ensureCapacity(FastStringBuffer.java:
575)
>         at
>
net.sf.saxon.tree.tiny.CompressedWhitespace.uncompress(CompressedWhitespace.j
ava:112)
>         at
>
net.sf.saxon.tree.tiny.WhitespaceTextImpl.appendStringValue(WhitespaceTextImp
l.java:82)
>         at
>
net.sf.saxon.tree.tiny.TinyParentNodeImpl.getStringValueCS(TinyParentNodeImpl
.java:99)
>         at
> net.sf.saxon.tree.tiny.TinyTree.getTypedValueOfElement(TinyTree.java:530)
>         at
> net.sf.saxon.tree.tiny.TinyElementImpl.atomize(TinyElementImpl.java:105)
>         at net.sf.saxon.expr.Atomizer.evaluateItem(Atomizer.java:384)
>         at net.sf.saxon.expr.Atomizer.evaluateItem(Atomizer.java:40)
>         at
>
net.sf.saxon.expr.ValueComparison.effectiveBooleanValue(ValueComparison.java:
347)
>         at
>
com.saxonica.ee.bytecode.ByteCodeCandidate.effectiveBooleanValue(ByteCodeCand
idate.java:132)
>         at
>
net.sf.saxon.expr.FilterIterator$NonNumeric.matches(FilterIterator.java:177)
>         at
>
net.sf.saxon.expr.FilterIterator.getNextMatchingItem(FilterIterator.java:76)
>         at net.sf.saxon.expr.FilterIterator.next(FilterIterator.java:62)
>         at net.sf.saxon.om
> .FocusTrackingIterator.next(FocusTrackingIterator.java:75)
>         at
>
net.sf.saxon.expr.FilterIterator.getNextMatchingItem(FilterIterator.java:75)
>         at net.sf.saxon.expr.FilterIterator.next(FilterIterator.java:62)
>         at net.sf.saxon.om
> .FocusTrackingIterator.next(FocusTrackingIterator.java:75)
>         at net.sf.saxon.om
> .SequenceIterator.forEachOrFail(SequenceIterator.java:134)
>         at
> net.sf.saxon.expr.instruct.ForEach.processLeavingTail(ForEach.java:489)
>         at
> net.sf.saxon.expr.instruct.Instruction.process(Instruction.java:136)
>         at
>
net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.j
ava:346)
>         at
>
net.sf.saxon.expr.instruct.ElementCreator.processLeavingTail(ElementCreator.j
ava:292)
>         at
>
net.sf.saxon.expr.instruct.TemplateRule.applyLeavingTail(TemplateRule.java:37
4)
>         at net.sf.saxon.trans.Mode.applyTemplates(Mode.java:555)
>         at
> net.sf.saxon.trans.XsltController.applyTemplates(XsltController.java:659)
>         at
>
net.sf.saxon.s9api.AbstractXsltTransformer.applyTemplatesToSource(AbstractXsl
tTransformer.java:360)
>         at
>
net.sf.saxon.s9api.Xslt30Transformer.applyTemplates(Xslt30Transformer.java:28
5)
>         at net.sf.saxon.Transform.processFile(Transform.java:1313)
>         at net.sf.saxon.Transform.doTransform(Transform.java:853)
>         at net.sf.saxon.Transform.main(Transform.java:82)

Current Thread