[xsl] Spotting "cousin marriages" in a tree

Subject: [xsl] Spotting "cousin marriages" in a tree
From: Phil Endecott <spam_from_xslt_list@xxxxxxxxxxxx>
Date: Wed, 28 Jul 2004 22:18:50 +0100
Dear XSLT experts,

I have a flat input like this:

<things>
  <thing id="1"> <child idref="2"/> <child idref="3"/> </thing>
  <thing id="2"/>
  <thing id="3"/>
</things>

I want to match up the ids with the idrefs to build a tree that looks like this:

<tree id="1">
  <tree id="2"/>
  <tree id="3"/>
</tree>

I could do it like this:

<xsl:template match="thing">
  <tree>
    <xsl:for-each select="child">
      <xsl:apply-templates
           select="/things/thing[@id=current()/@idref]"/>
    </xsl:for-each>
  </tree>
</xsl:template>

But that is O(n^2), so I'd instead do it like this:

<xsl:key name="things-by-id" match="thing" use="@id"/>
<xsl:template match="thing">
  <tree>
    <xsl:for-each select="child">
      <xsl:apply-templates select="key('things-by-id',@idref)"/>
    </xsl:for-each>
  </tree>
</xsl:template>

which is O(n log n).

This is all fine. But very rarely my input may not describe a pure tree; it may contain "cousin marriages" or more "incestuous" relationships like this:

<things>
  <thing id="1"> <child idref="2"/> <child idref="3"/> </thing>
  <thing id="2"> <child idref="4"/> </thing>
  <thing id="3"> <child idref="4"/> </thing>
  <thing id="4"/>
</things>

If you apply either of the above templates to this, thing 4 appears in the output twice:

<tree id="1">
  <tree id="2">
    <tree id="4"/>
  </tree>
  <tree id="3">
    <tree id="4">
  </tree>
</tree>

which is very bad in my application. What I need is something like this:

<tree id="1">
  <tree id="2">
    <tree id="4"/>
  </tree>
  <tree id="3">
    <error/>
  </tree>
</tree>

I can't afford for this to become an O(n^2) operation.

The obvious thought is that I need to check at the point of recursion whether the thing I am about to process has already been output. But there is no "preceding-in-the-output" axis or other way I can think of to note which things have already been output.

An alternative is to apply a test to the other input elements to see if other things have the same thing as this one as a child, but that fails because my input may legitimately contain redundant nodes like this:

<things>
  <thing id="1"> <child idref="2"/> <child idref="3"/> </thing>
  <thing id="2"/>
  <thing id="3"/>
  <thing id="9"> <child idref="2"/> </thing>
</things>

In this case I want the same output as the very first example, i.e. thing 9 doesn't appear in the output at all.

I can do a multi-pass process using exsl:node-set(), generating the tree with the duplicates in it and then deleting them if they are duplicates, i.e. something like this:

<xsl:template match="/">
  <xsl:variable name="n">
    <xsl:apply-templates select="things/thing[1]"/>  (as above)
  </xsl:variable>
  <xsl:apply-templates select="exsl:node-set($n)" mode="remove-dupes"/>
</xsl:template>

<xsl:template match="tree" mode="remove-dupes">
  <xsl:choose>
    <xsl:when test="preceding::tree[@id=current/@id]">
      <error/>
    </xsl:when>
    <xsl:otherwise>
      <tree id="{@id}">
        <xsl:apply-templates mode="remove-dupes"/>
      </tree>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

There are two problems with this. First, generating the intermediate result could run forever if the input is seriously malformed (e.g. a loop, child refers to parent); checking as it is generated would avoid this. Second, the test "preceding::tree[@id=current/@id]" is O(n^2). Can I use a key to avoid this? How can a key be applied to an exsl:node-set() result?

XSLT often needs some lateral thinking to come up with a good solution. I hope that someone who's read this far has a trick to solve this for me.

Thanks in advance!

--Phil.

Current Thread