Re: [xsl] finding a minimum set of references

Subject: Re: [xsl] finding a minimum set of references
From: Graydon <graydon@xxxxxxxxx>
Date: Wed, 9 Jan 2013 20:55:57 -0500
On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit:
> This can be done in three steps:
>  1. Get the distinct values of concat(@area,'+',@site) for all tree elements.
>  2. Get a sorted sequence of the corresponding `loc` elements that
> were produced in step 1. The sort is by the count of `tree` elements
> that have that particular loc element as descendant.
>  3. Recursively add to the "minList" the top of the sorted loc
> elements that isn't yet in this "minList". Stop condition -- when
> there is nothing more to add.

This is very similar to what I did, after having woken up at the 3 AM
local going "wait! I can do this!" and making rapid notes.  (Having
learnt what happens should I _not_ make those notes!)

(Much thanks to Wolfgang for saying "use keys!" last night.)

What I did was to go through each tree element's loc element children, and
count how many other such loc elements existed anywhere in the data set. (Keys
were very useful for this, and just let me note that trying to stick the result
of a key lookup that returns a node in a comment via xsl:sequence during
debugging is something I should remember not to do more reliably.)

Once I had that count associated with each loc element (@other), I deleted all
the loc elements but the one with the highest value for @other.

Third step was to make a unique list of the @area/@cite pairs contained in
those surviving loc elements, and consider that to represent the minimum
spanning set of documents that contain all the element parent/child pattern

(There are other steps about "the example of this parent/child pattern is in
this document" and fetching those documents and so on, but those aren't in the
code sample below, which is already plenty long enough.)

> I believe that this would really produce the minimum sequence of
> locations -- due to the fact that we pick locations in decreasing
> order of trees that reference them.

I think what I've got, which I suspect is equivalent in a graph-theoretic
sense, also does this.  I am also really really glad my current clients aren't
going to ask me to _prove_ that.

The input set this was tested on had 19,503 documents in it; the minimum
spanning set of examples that results has 294 documents in it, so about 1.5% of
the total.  If that's not actually minimal, it's close enough.  Total,
end-to-end run time (via Saxon 9.4.something in oXygen), from "make a
collection out of this half-GB pile of documents" to "minimum spanning set of
documents in a pile" is about 65 seconds. (The bit below, run independently on
the big set of tree elements, takes about a second.)

I'm really pleased with this result.  (Some of the previous attempts were
taking more than half an hour of run time to not work....)

> I would be willing to produce the code that implements this algorithm,
> but I would need to be provided with a complete (but not too long)
> source XML document -- so that the result could be easily verified

That's extraordinarily kind of you, but I really did just need the algorithm
hint.  (This time, at least! :)

I'm greatly reassured at the similarity of the suggested methodology, too!

Thanks, Dimitre!

-- Graydon

<xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" xmlns:xd="";
  xmlns:xs=""; xmlns:xsl="";>
  <xd:doc scope="stylesheet">
      <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p>
      <xd:p><xd:b>Author:</xd:b> graydon</xd:p>
  <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/>
  <xsl:template match="/">
    <xsl:variable name="getLocCounts">
      <xsl:apply-templates mode="countLoc" select="node()"/>
    <xsl:variable name="findMostOther">
      <xsl:apply-templates mode="mostOther" select="$getLocCounts"/>
    <xsl:variable name="makeDocumentList">
      <xsl:apply-templates mode="docList" select="$findMostOther"/>
    <xsl:sequence select="$makeDocumentList"/>
  <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify">
      <xsl:apply-templates mode="#current" select="node()|@*"/>
  <xsl:template match="loc" mode="countLoc">
      <xsl:attribute name="other" select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/>
      <xsl:apply-templates mode="countLoc" select="node()|@*"/>
  <xsl:template match="locations" mode="mostOther">
      <xsl:for-each-group group-by="@other" select="loc">
        <xsl:sort data-type="number" order="descending" select="@other"/>
        <xsl:if test="position() eq 1">
          <xsl:sequence select="current-group()[1]"/>
  <xsl:template match="container" mode="docList">
      <assembly area="testpub" xml:lang="en">
        <xsl:for-each-group group-by="concat(@area,'|',@cite)" select="tree/locations/loc">
          <xsl:sort select="current-grouping-key()"/>
          <include area="{current-group()[1]/@area}" cite="{current-group()[1]/@cite}"/>

Current Thread