Re: [xsl] Tree Comparing Algorithm

Subject: Re: [xsl] Tree Comparing Algorithm
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 4 Feb 2020 00:49:31 -0000
I haven't studied it in close detail, but I strongly suspect that the initial
processing of the input files is streamed, but at some stage in the processing
pipeline everything ends up in memory.

Martin's solution uses arrays, and array processing in Saxon is generally not
pipelined in the way that sequence processing (normally) is. For example,
operations such as filtering and mapping on sequences are generally pipelined
(whether or not the input is streamed), while the equivalent operations on
arrays will materialise the array in memory.

For example, if you do (child::*[@x][1]/node-name() = $Q), then whether or not
the child nodes are held in memory or streamed, Saxon will not build the
intermediate sequence child::*[@x] in memory; it will effectively do something
like

for each child::*
   if exists(@x)
      if (first)
          if (node-name() = $Q)
              return true
      else return false;

There's no equivalent of this for array processing right now. A construct like
[child::*/node-name()]?1 = X will materialize the array in memory, even if
child::* is streamed; and this doesn't count as a violation of streamability,
because we're not holding source nodes in memory, we're holding intermediate
computed results in memory.

There's no intrinsic reason for not pipelining operations on arrays, other
than the lesson I learnt many years ago as an undergraduate computer science
student: when you're doing optimisation, focus your efforts on the constructs
that are encountered most frequently. Today everyone is using sequences, and
not many people are using arrays.

Michael Kay
Saxonica

> On 3 Feb 2020, at 20:39, Martin Honnen martin.honnen@xxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>
> On 03.02.2020 21:10, Vasu Chakkera vasucv@xxxxxxxxx wrote:
>> Thanks both. Martin's solution sort of worked, but it only gave me 21
>> children, but I had around 21000 nodes in the xml. I am not sure to what
>> depth the comparison is happening.
>
> It was solely an attempt to try to find some way to recursively process
> two documents with streaming at the same time, not an attempt to
> implement your particular algorithm.
>
> I have tested my code now on a large files, it seems to process lots of
> nodes judging by the output to the console and the length of processing,
> but it doesn't seem to use streaming when I look at the memory
> consumption (600MB  of input needed more than 2GB of memory), even if
> Saxon nowhere shows any -t message that input trees were built.
>
> Michael's comment on the way streaming is implemented in Saxon suggests
> that the whole attempt is futile, even if the code somehow manages to
> get by the streamability analysis.

Current Thread