Re: [xsl] Seek ways to make my streaming XSLT code run faster (My streaming XSLT program has been running 12 hours and is only a quarter of the way to completion)

Subject: Re: [xsl] Seek ways to make my streaming XSLT code run faster (My streaming XSLT program has been running 12 hours and is only a quarter of the way to completion)
From: "David Birnbaum djbpitt@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 10 Aug 2025 18:19:54 -0000
Dear All,
OP didnbt mention whether the task is imagined as a one-off or as, say,
a service. If itbs a one-off, front loading the indexing by using an
XQuery database doesnbt sound like an automatic saving with respect to
efficiency (over indexing within XSLT; see below) because the code builds
an index once and then uses it once. At the same time, thinking in terms
of XQuery database indexing keeps the focus on indexing, which can often
pay off (massively) with nested loops. Streaming helps (potential) with
large memory demands, but looping over the same data repeatedly may not
be that sort of task.
Whether XQuery database indexing is more efficient than indexing with
<xsl:key> in the case of a one-off is less clear. 
Best,
David

  On Aug 10, 2025, at 11:30b/AM, Wendell Piez wapiez@xxxxxxxxxxxxxxx
  <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

  o;?Hello,
  To restate what Liam just said, more blatantly: this is an indexing
  problem. Streaming mode is getting in the way. What Liam is
  suggesting is a two-step solution where streaming is used to create a
  document efficiently that is more efficient to index, and to index
  that document. (Right, Liam?)
  That can be an effective approach to managing the complexity as well
  as the scale, but the basic fact is that this is still indexing. This
  is why it could make more sense to load it into an XQuery engine to
  benefit from its front-loading of indexing into the XML.
  But it could still be done straightforwardly in XSLT. I suggest Roger
  think hard about Liam's suggestion and break it down like this:
  1. Write XSLT to provide the result you want for a single value (the
  'ABC' in the example) using xsl:key - no streaming. (If scale is an
  impediment at this point, use a reduced sample and test over that.)
  2. Assess efficiency -  is there a way to streamline the source data
  to make this more efficient and faster, by simplifying the XML source
  and hence the XSLT? Design a source data format optimized for step 1.
  Demonstrate the improvement with a new XSLT.
  3. Once this works, turn to a new/different XSLT that can produce
  this optimal source from your current  (full-size) data set. This
  XSLT might well use streaming. Streaming won't make it faster but it
  will reduce memory use so it doesn't bomb out. Your output should be
  smaller (maybe much smaller) than your input.
  4. Produce this optimal source, test, and return to step 2 as
  necessary - except with your new XSLTs, both the 'digester' and the
  'indexer'.
  5. Scale up to your 1900.
  Don't think about iteration, loops, or streams. Just think about how
  to make it easy for XPath to see what it is doing at any given point.
  The two steps (first, homogenize, then index) could be done in a
  single XSLT with phases, or in XProc or the pipelining framework of
  your choice.
  Of course there's a reasonable chance I am oversimplifying and
  getting it wrong - but maybe not by much.
  Cheers, Wendell

  On Sat, Aug 9, 2025 at 7:32b/PM Liam R. E. Quin
  liam@xxxxxxxxxxxxxxxx <xsl-list-service@lists. mulberrytech.com>
  wrote:

    On Sat, 2025-08-09 at 23:00 +0000, Liam R. E. Quin
    liam@xxxxxxxxxxxxxxxx wrote:
    > On Sat, 2025-08-09 at 22:25 +0000, Roger L Costello
    > costello@xxxxxxxxx
    > >
    > > I want to iterate over all 1900 identifiers and for each of
    them,
    > > iterate over all 5 million records to see which records
    contain the
    > > identifier. There is a loop within a loop:
    > >
    > > For each 1900 identifiers do
    > > For each 5 million records do
    > > Check record against identifier
    >
    > Outside streaming, you could
    >   apply-templates select="/records/record"
    > and then have a template
    > match="VOR_identifier[ancestor::record[
    > not(Airport_SID_Primary_Records)
    > ]
    >
    > and then process the record in a different mode?

    to be clear you can't do that in streaming mode. If the document
    is too
    large and you need to stream, you could have a template to match
    record, and take a grounded snapshot of it, and process that in a
    different (non-streaming) mode.

    Any time you start thinking in terms of loops i think itbs time
    to take
    a step back, especially in streaming, and ask if you can use
    template
    match expressions to do more of the work, and also whether you
    can work
    back from the result a bit more.

    >
    > Otherwise yes, XQuery.
    >

    --
    Liam Quin, https://www.delightfulcomputing.com/
    Available for XML/Document/Information Architecture/XSLT/
    XSL/XQuery/Web/Text Processing/A11Y training, work & consulting.
    Barefoot Web-slave, antique illustrations:
    http://www.fromoldbooks.org

  --
  ...Wendell Piez... ...wendell -at- nist -dot- gov...
  ...wendellpiez.com... ...pellucidliterature.org...
  ...pausepress.org...
  ...github.com/wendellpiez... ...gitlab.coko.foundation/wendell...
  XSL-List info and archiveEasyUnsubscribe (by email)

XSL-List info and archiveEasyUnsubscribe (by email)

Current Thread