Re: [xsl] Ordered union of sequences

Subject: Re: [xsl] Ordered union of sequences
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Fri, 09 Apr 2010 00:25:51 +0200
On 08.04.2010 17:50, Michael M|ller-Hillebrand wrote:
Am 08.04.2010 um 16:09 schrieb Michael Kay:

It seems to me that you first want to create (or imagine) a graph: in your
example there are arcs k->o, o->p, p->c, a->b, b->c, etc.

Then you want to look for cycles in this graph. If any cycles exist, there
is no solution to your problem.

Thanks Michael, I guess I have to finally understand the graph stuff. I will turn to the example in the 4th edition p.251 ff. for a starter.


Am 08.04.2010 um 16:47 schrieb Imsieke, Gerrit, le-tex:

[... cool, ready-made example ...]

Does that make sense?

If I include<o/>  at the other position, i.e.,
  <seq><k/><f/><z/><o/></seq>,
I receive "Too many nested function calls. May be due to infinite recursion." as expected.

Gerrit, I am at early stages to understand the algorithm. The function counts the maximum number of preceding siblings on any available path and uses this as a sort key. This is some sort of creating a graph, backtracing to the beginning... which is what Michael suggested, isn't it.

Yes. Firstly, the items (all elements below seq) are grouped by element name.


The groups, i.e., the distinct element names, are sorted by the result of the my:sortkey() function, applied to each group's first item.

my:sortkey() is calculated as follows:
Consider the group of all c elements. There are two c elements in total, the first one being in the first sequence, the second one in the last sequence.
my:sortkey() is initialized with the first c element (and with the empty sequence as a second argument, discussed below).
Then the variable preceding-siblings is calculated. Maybe this variable name isn't chosen wisely, since the variable doesn't contain the first c's preceding siblings. It rather contains the respective first preceding siblings for each seq/c element. That is, p of the first sequence and b of the last sequence. For p and for b, my:sortkey() is calculated, the maximum value of these results is then increased by 1 (since c comes after p or b, its sort key should be higher than p's or b's sort keys).
For the calculation of p's sort key, o's sort key has to be calculated, because o is p's predecessor. For b, a's sort key has to be calculated, which is 1 because the only a element doesn't have a predecessor.
o's sort key is an interesting case: if o is contained at the end of the 3rd sequence, the recursion will calculate z's sort key, then in turn f's, then in turn k's (=1) and c's, then in turn b's and p's, then in turn o's and a's (=1). That's where the circle comes into play. The recursion follows Michael Kay's graph in reverse direction (the nodes in this graph are not the XML snippet elements, but the unique element names "o", "p", "b", ...).
There's a path in the graph that origins from "o" and returns to "o", which amounts to saying that the graph is circular.
In order to detect circularity, the second argument to my:sortkey(), $seen, contains all element names that have been seen during recursion.
When calling the function the first time for c, $seen is empty. When calling it for p and for b, $seen is ("c"). When calling the function for o from p, $seen is ("c", "p"). Then for z it's ("c", "p", "o"), then for f it's ("c", "p", "o", "z"), then for c it's ("c", "p", "o", "z", "f") which will terminate with a message that c's position in the unified sequence cannot be determined.
The fact that actual execution of the template stopped with the message that o's position isn't determinate is purely by chance. Swap the first and the second sequence and the template will complain that d's position isn't determinate. This is understandable: if there are circles in the graph, then every node that is on a circle has an indeterminate position on a fictional linear sequence. Which one will actually be reported depends on document order.


So maybe the information that there's a circle that contains o or d isn't so helpuful for your users. So I enhanced the reporting:

  <xsl:function name="my:sortkey" as="xs:integer">
    <xsl:param name="input" as="element(*)" />
    <xsl:param name="seen" as="xs:string*" />
    <xsl:if test="name($input) = $seen">
      <xsl:message terminate="yes">Element
        <xsl:value-of select="name($input)"/>
        doesn't seem to occur at a determinate
        position in a sequence. The circle is
        <xsl:value-of select="string-join((name($input), $seen), '-')"/>
      </xsl:message>
    </xsl:if>
    <xsl:variable
      name="preceding-siblings"
      select="$input/../../seq/*[
                name() = name($input)
              ]/preceding-sibling::*[1]"
      as="element(*)*" />
    <xsl:sequence select="(
                            max(
                              for $ps in $preceding-siblings
                              return my:sortkey($ps, (name($input), $seen))
                            ) + 1,
                            1
                          )[1]"/>
  </xsl:function>

Explaining this approach took three times as long as writing the code, and re-reading your message tells me that I didn't need to explain it since you already grasped it. But it was fun.

Gerrit


Thanks a lot for the most valuable input!


I need a result and the inconsistency report, so the user can fix the input. I will be looking into stopping the infinite recursion somehow.

- Michael


There is an arbitrary number of sequences, sometimes
containing items
with the same name:

(k, o, p, c, f)
(d, e, f, g)
(k, f, z, o)
(a, b, c, d)

I want to create a master sequence which contains every item once,
preserving the original order.

Current Thread