Subject: Re: [xsl] Tree Comparing Algorithm|
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 4 Feb 2020 00:49:31 -0000
I haven't studied it in close detail, but I strongly suspect that the initial processing of the input files is streamed, but at some stage in the processing pipeline everything ends up in memory. Martin's solution uses arrays, and array processing in Saxon is generally not pipelined in the way that sequence processing (normally) is. For example, operations such as filtering and mapping on sequences are generally pipelined (whether or not the input is streamed), while the equivalent operations on arrays will materialise the array in memory. For example, if you do (child::*[@x]/node-name() = $Q), then whether or not the child nodes are held in memory or streamed, Saxon will not build the intermediate sequence child::*[@x] in memory; it will effectively do something like for each child::* if exists(@x) if (first) if (node-name() = $Q) return true else return false; There's no equivalent of this for array processing right now. A construct like [child::*/node-name()]?1 = X will materialize the array in memory, even if child::* is streamed; and this doesn't count as a violation of streamability, because we're not holding source nodes in memory, we're holding intermediate computed results in memory. There's no intrinsic reason for not pipelining operations on arrays, other than the lesson I learnt many years ago as an undergraduate computer science student: when you're doing optimisation, focus your efforts on the constructs that are encountered most frequently. Today everyone is using sequences, and not many people are using arrays. Michael Kay Saxonica > On 3 Feb 2020, at 20:39, Martin Honnen martin.honnen@xxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > On 03.02.2020 21:10, Vasu Chakkera vasucv@xxxxxxxxx wrote: >> Thanks both. Martin's solution sort of worked, but it only gave me 21 >> children, but I had around 21000 nodes in the xml. I am not sure to what >> depth the comparison is happening. > > It was solely an attempt to try to find some way to recursively process > two documents with streaming at the same time, not an attempt to > implement your particular algorithm. > > I have tested my code now on a large files, it seems to process lots of > nodes judging by the output to the console and the length of processing, > but it doesn't seem to use streaming when I look at the memory > consumption (600MB of input needed more than 2GB of memory), even if > Saxon nowhere shows any -t message that input trees were built. > > Michael's comment on the way streaming is implemented in Saxon suggests > that the whole attempt is futile, even if the code somehow manages to > get by the streamability analysis.