Re: [xsl] Tree Comparing Algorithm

Subject: Re: [xsl] Tree Comparing Algorithm
From: "Martin Honnen martin.honnen@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 3 Feb 2020 14:57:43 -0000
Am 03.02.2020 um 14:47 schrieb Vasu Chakkera vasucv@xxxxxxxxx:
> Hi All,
> I am planning to write a XML Tree comparing XSLT using streaming.
> The XML Trees look something like this
>
> <root path="" mhash =" ">
>
> <folder path ="" mhash ="">
>
> <folder path ="" mhash ="">
>
> <leaf path ="" mhash ="">
>
> </leaf>
>
> </folder>
>
> </folder>
>
> </root>
>
> There will be two such XML files to compare . These two XMLs are
> generated before and after moving a folder from source to destination.
> Source and destination could be two different OS.
>
> This is essentially the serialized Merkle Tree output of a folder
> structure. The idea is to run a Merkle Tree comparator that will pick
> the nodes that did not match. Rules are as follows.
>
>  1. If the root node in both the tree matches, then there is not
>     difference in the entire tree(because of how the Merkle tree is
>     generated)
>  2. If root node hash does not match, we go to the child container and
>     compare the hash of the child container in both the XML files. (
>     the XML folders structureB will be identical with respect to the
>     hash, but the folder path may be different because of the linux,
>     windows path conventions. Otherwise the folder structure is meant
>     to be the same.)
>  3. If the hash of a folder from both the trees are same, the entire
>     tree under the folder that matches the hash is ignored.
>  4. if the hash of a folder from both the trees are not the same, then
>     the tree is further traversed and the step 3 is repeated.
>  5. The XSLT keeps writing out the nodes that do not match the hashes
>     in the source and target xml files
>
>
> So at the end of the processing, A comparator tree should be
> serialized, that has the nodes that have a non matching leaf node.
> Looking at the serialized tree, we can determine, which files got
> messed up while doing a transfer from Source to target.
>
>
>
> I am able to do this using non streaming xslt, but with streaming,
> since we need to stream two trees at a time and match compare the
> nodes,B  i am not very sure how to proceed.
> I am able to do manipulations on one XML with streaming. I tried a few
> tricks, but did not get anywhere ( I am not very comfortable copying
> my code scribbling here)
>
> I need streaming because the XML files may be big.
> If someone has done something similar, or point me to an intelligent
> way to do this, I will be thankful.
>
I am not sure there is an option within the XSLT 3 spec constraints as
the `xsl:merge` allows you to process more than one merge-source with
streaming but always takes a snapshot, so any attempt to recursively
process your files would take a snapshot at the highest selected level
and that way not really save memory.


And using two xsl:source-document to process two documents at the same
time with a function or template for recursion seems to be difficult
with the constraints that streamable nodes need to be passed in as the
first argument.

That made me think about trying to pass in an array in Saxon 9.9 EE with
the saxon:stream function used:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
 B B B  xmlns:xs="http://www.w3.org/2001/XMLSchema";
 B B B  xmlns:saxon="http://saxon.sf.net/";
 B B B  xmlns:mf="http://example.com/mf";
 B B B  exclude-result-prefixes="#all"
 B B B  version="3.0">

 B B B  <xsl:output method="text"/>

 B B B  <xsl:function name="mf:compare" streamability="absorbing">
 B B B B B B B  <xsl:param name="pair" as="array(element())"/>
 B B B B B B B  <xsl:value-of select="$pair?1!node-name(),
$pair?2!node-name(),
$pair?1!node-name() = $pair?2!node-name(), $pair?1!@name = $pair?2!@name"/>
 B B B B B B B  <xsl:text>&#10;</xsl:text>
 B B B B B B B  <xsl:sequence select="for-each-pair($pair?1!*, $pair?2!*,
function($el1, $el2) { mf:compare([$el1, $el2]) })"/>
 B B B  </xsl:function>

 B B B  <xsl:template name="xsl:initial-template">
 B B B B B B B  <xsl:sequence
select="mf:compare([saxon:stream(doc('file1.xml')/*),
saxon:stream(doc('file2.xml')/*)])"/>
 B B B  </xsl:template>

</xsl:stylesheet>


Saxon doesn't complain and reports it is streaming, the output when run
with options -t -it against two files is like

Streaming file:/SomePath/file1.xml
URIResolver.resolve href="file1.xml" base="file:/SomePath/sheet2.xsl"
Streaming input document file:/SomePath/file1.xml
Using parser
com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser
Streaming file:/SomePath/file2.xml
URIResolver.resolve href="file2.xml" base="file:/SomePath/sheet2.xsl"
Streaming input document file:/SomePath/file2.xml
root root true false
folder folder true true
leaf leaf true true
folder folder true false
leaf leaf true true


I am not sure what happens with the `saxon:stream` function in future
releases or whether that whole approach is useful, I am not sure it does
really recursively process both files with streaming.

Current Thread