Re: [xsl] Comparing documents: what of P is a subset of D?

Subject: Re: [xsl] Comparing documents: what of P is a subset of D?
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Fri, 28 Feb 2014 11:57:03 +0100
@Michael: your answer triggered a thought process that outlined the way to
a solution I'm able to implement. I don't know whether this is of any
interest to
others, but it's a nice little exercise for a training, illustrating mode, key,
another input document.

Problem:
Given two XML files according to the same XML schema, find all leave
nodes (text() and @*) in one document ("Patch") that have an identical
value at the same iXPath
in the other document ("Data"), where an iXPath is an XPath using
element, attribute names and predicates [@_ix eq n] wherever they
occur (in repeating elements).

Solution outline:
Process the Patch document, creating a set of nodes <p2v @path @value>
mapping iXPaths to values, with a key based on @path. Then, process
the Data document analoguously, looking up iXPaths in the key and
comparing values, where found.

Below is the code, very likely not perfect ;-)

(Note that the output would be much more readable if an iXPath could
be truncated at a point where the subtree is identical in the defined
way.)

Thanks
W

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
	xmlns:xs="http://www.w3.org/2001/XMLSchema";
	xmlns:wl="http://members.inode.at/w.laun";
	xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:output method="text" />
<xsl:strip-space elements = '*' />

<xsl:param name="patchfile" as="xs:string"/>
<xsl:variable name="patch" select="document($patchfile)" />

<xsl:key name = "path2value" match = "p2v" use = "@path"/>

<!-- pass over patch file -->

<xsl:variable name="map" as="document-node()">
  <xsl:document>
    <map>
    <xsl:for-each select = "$patch">
      <xsl:apply-templates select = "*" mode="indexing">
        <xsl:with-param name = "path" select = "''" />
      </xsl:apply-templates>
    </xsl:for-each>
    </map>
  </xsl:document>
</xsl:variable>

<xsl:template match="*" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="indexing">
    <xsl:with-param name = "path" select = "concat( $path, '/',
local-name() )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*[@_ix]" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="indexing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name(), '[',
@_ix, ']' )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="@*" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "fp" select = "concat( $path, '/', local-name() )"/>
  <p2v path = "{$fp}" value = "{.}"/>
</xsl:template>

<xsl:template match="@_ix" mode="indexing"/>

<xsl:template match="text()" mode="indexing">
  <xsl:param name = "path" as = "xs:string" />
  <p2v path = "{$path}" value = "{.}"/>
</xsl:template>

<!-- Pass over DB data file -->

<xsl:template match = "/">
  <xsl:apply-templates mode="comparing">
    <xsl:with-param name = "path" select = "''" />
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="comparing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name() )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="*[@_ix]" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:apply-templates select = "*|@*|text()" mode="comparing">
    <xsl:with-param name = "path"
                    select = "concat( $path, '/', local-name(), '[',
@_ix, ']' )"/>
  </xsl:apply-templates>
</xsl:template>

<xsl:template match="@*" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "fp" select = "concat( $path, '/', local-name() )"/>
  <xsl:variable name = "pval" select = "key( 'path2value', $fp,
$map/map )/@value"/>
  <xsl:if test = "$pval eq .">
    <xsl:value-of select = "concat( $fp, ' ... ', $pval)"/><xsl:text>
</xsl:text>
  </xsl:if>
</xsl:template>

<xsl:template match="@_ix" mode="comparing"/>

<xsl:template match="text()" mode="comparing">
  <xsl:param name = "path" as = "xs:string" />
  <xsl:variable name = "pval" select = "key( 'path2value', $path,
$map/map )/@value"/>
  <xsl:if test = "$pval eq .">
    <xsl:value-of select = "concat( $path, ' ... ', $pval)"/><xsl:text>
</xsl:text>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>



On 27/02/2014, Michael Kay <mike@xxxxxxxxxxxx> wrote:
> I'm not sure I've completely understood your "equality" relation that
> underpins the intersection. Perhaps it's based on equality of the function
>
> string-join(ancestor-or-self::*/@_ix, '|')
>
> let's call this function $f, and we can use this as a parameter to the rest
> of the solution.
>
> we then need to do
>
> doc('d.xml')//fc[some $e in doc('p.xml') satisfies $f($e) eq $f(.)] !
> path(.)
>
> where path(.) is a function you can write to display the path to the
> selected fc element.
>
> The only remaining problem is that this is O(n*m) where n and m are the
> sizes of D and P. For a more efficient solution, define a key on P.XML that
> indexes each element on the value of the function $f, and replace the
> predicate by a call on key().
>
> The above uses XPath 3.0, but it can probably be expressed in XPath 2.0
> easily enough at the cost of hard-coding the equality function.
>
> Michael Kay
> Saxonica
>
>
> On 27 Feb 2014, at 10:25, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote:
>
>> <cca><!-- a D XML -->
>>  <rela _ix='0' fa='0' fb='1'>
>>     <fc _ix='1' fc_fa='X1' fc_fb='1'/>
>>     <fc _ix='2' fc_fa='X2' fc_fb='2'/>
>>  </rela>
>>  <rela _ix='1' fa='10' fb='11'>
>>     <fc _ix='1' fc_fa='Y1' fc_fb='11'/>
>>     <fc _ix='2' fc_fa='Y2' fc_fb='12'/>
>>  </rela>
>>  <rela _ix='5' fa='50' fb='51'>
>>     <fc _ix='1' fc_fa='A1' fc_fb='51'/>
>>     <fc _ix='2' fc_fa='A2' fc_fb='52'/>
>>  </rela>
>>  <relb>...</relb>
>>  <relc>...</relc>
>> </cca>
>>
>> <cca><!-- a P XML -->
>>  <rela _ix='1' fa='10'>
>>     <fc _ix='1' fc_fa='Y1' fc_fb='99'/>
>>  </rela>
>> <rela _ix='5' fa='50' fb='51'>
>>     <fc _ix='1'                 fc_fb='51' fc_fc='123'/>
>>     <fc _ix='2' fc_fa='A2' fc_fb='52' fc_fc='456'/>
>>  </rela>
>> </cca>
>>
>> Expected output:
>>
>> /cca/rela(1)/fa   10
>> /cca/rela(1)/fc(1)/fc_fa   Y1
>> /cca/rela(5)/fa   50
>> /cca/rela(5)/fa   51
>> /cca/rela(5)/fc(1)/fc_fb   51
>> /cca/rela(5)/fc(2)/fc_fa   A2
>> /cca/rela(5)/fc(2)/fc_fb   52
>>
>> Note that parentheses enclose values of @_ix.
>>
>> -W
>>
>> On 27/02/2014, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>>> It would be easier to understand the problem with some example data.
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>> On 27 Feb 2014, at 08:05, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote:
>>>
>>>> The data model for a set of similarly (but not identically) built XML
>>>> documents is: a collection of arrays of records, which may contain
>>>> (recursively) arrays, records and scalars. (The terms "array" and
>>>> "record" are used in their "classic" meaning as, e.g., in Pascal.)
>>>> Document structures are fairly stable, but they do change over time.
>>>> Array elements are identified (indexed) by @_ix, not by position.
>>>> Record fields can be elements or attributes (when they are scalar).
>>>> Order is undefined, since XPaths plus @_Ix's pinpoint each node.
>>>>
>>>> One XML document D contains a full population for such a data set
>>>> (O(1MB)). A second XML document P contains "patches", i.e., each node
>>>> appearing in P is expected to be in D as well.
>>>>
>>>> If S(P) is the sequence of nodes (annotated with their XPaths) in P
>>>> and S(D) the one with nodes from D, how can I determine S(P) intersect
>>>> S(D) (except all @_ix, whose values are bound to be identical)? Of
>>>> course, I don't want the common set of *data items* - I want the XML
>>>> paths of those common data items.
>>>>
>>>> A solution (in XSLT 2.0) should not need individual adaption for each
>>>> kind of data set.
>>>>
>>>> I'm confident that I can create text files for D and P containing one
>>>> line <path> <value> for each node and run diff (after sort).
>>>>
>>>> Any better ideas?
>>>>
>>>> Cheers
>>>> Wolfgang

Current Thread