Re: [xsl] Comparing documents: what of P is a subset of D?

Subject: Re: [xsl] Comparing documents: what of P is a subset of D?
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Thu, 27 Feb 2014 14:32:31 +0000
I'm not sure I've completely understood your "equality" relation that
underpins the intersection. Perhaps it's based on equality of the function

string-join(ancestor-or-self::*/@_ix, '|')

let's call this function $f, and we can use this as a parameter to the rest of
the solution.

we then need to do

doc('d.xml')//fc[some $e in doc('p.xml') satisfies $f($e) eq $f(.)] ! path(.)

where path(.) is a function you can write to display the path to the selected
fc element.

The only remaining problem is that this is O(n*m) where n and m are the sizes
of D and P. For a more efficient solution, define a key on P.XML that indexes
each element on the value of the function $f, and replace the predicate by a
call on key().

The above uses XPath 3.0, but it can probably be expressed in XPath 2.0 easily
enough at the cost of hard-coding the equality function.

Michael Kay
Saxonica


On 27 Feb 2014, at 10:25, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote:

> <cca><!-- a D XML -->
>  <rela _ix='0' fa='0' fb='1'>
>     <fc _ix='1' fc_fa='X1' fc_fb='1'/>
>     <fc _ix='2' fc_fa='X2' fc_fb='2'/>
>  </rela>
>  <rela _ix='1' fa='10' fb='11'>
>     <fc _ix='1' fc_fa='Y1' fc_fb='11'/>
>     <fc _ix='2' fc_fa='Y2' fc_fb='12'/>
>  </rela>
>  <rela _ix='5' fa='50' fb='51'>
>     <fc _ix='1' fc_fa='A1' fc_fb='51'/>
>     <fc _ix='2' fc_fa='A2' fc_fb='52'/>
>  </rela>
>  <relb>...</relb>
>  <relc>...</relc>
> </cca>
>
> <cca><!-- a P XML -->
>  <rela _ix='1' fa='10'>
>     <fc _ix='1' fc_fa='Y1' fc_fb='99'/>
>  </rela>
> <rela _ix='5' fa='50' fb='51'>
>     <fc _ix='1'                 fc_fb='51' fc_fc='123'/>
>     <fc _ix='2' fc_fa='A2' fc_fb='52' fc_fc='456'/>
>  </rela>
> </cca>
>
> Expected output:
>
> /cca/rela(1)/fa   10
> /cca/rela(1)/fc(1)/fc_fa   Y1
> /cca/rela(5)/fa   50
> /cca/rela(5)/fa   51
> /cca/rela(5)/fc(1)/fc_fb   51
> /cca/rela(5)/fc(2)/fc_fa   A2
> /cca/rela(5)/fc(2)/fc_fb   52
>
> Note that parentheses enclose values of @_ix.
>
> -W
>
> On 27/02/2014, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>> It would be easier to understand the problem with some example data.
>>
>> Michael Kay
>> Saxonica
>>
>> On 27 Feb 2014, at 08:05, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote:
>>
>>> The data model for a set of similarly (but not identically) built XML
>>> documents is: a collection of arrays of records, which may contain
>>> (recursively) arrays, records and scalars. (The terms "array" and
>>> "record" are used in their "classic" meaning as, e.g., in Pascal.)
>>> Document structures are fairly stable, but they do change over time.
>>> Array elements are identified (indexed) by @_ix, not by position.
>>> Record fields can be elements or attributes (when they are scalar).
>>> Order is undefined, since XPaths plus @_Ix's pinpoint each node.
>>>
>>> One XML document D contains a full population for such a data set
>>> (O(1MB)). A second XML document P contains "patches", i.e., each node
>>> appearing in P is expected to be in D as well.
>>>
>>> If S(P) is the sequence of nodes (annotated with their XPaths) in P
>>> and S(D) the one with nodes from D, how can I determine S(P) intersect
>>> S(D) (except all @_ix, whose values are bound to be identical)? Of
>>> course, I don't want the common set of *data items* - I want the XML
>>> paths of those common data items.
>>>
>>> A solution (in XSLT 2.0) should not need individual adaption for each
>>> kind of data set.
>>>
>>> I'm confident that I can create text files for D and P containing one
>>> line <path> <value> for each node and run diff (after sort).
>>>
>>> Any better ideas?
>>>
>>> Cheers
>>> Wolfgang

Current Thread