Re: [xsl] finding a minimum set of references

Subject: Re: [xsl] finding a minimum set of references
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Wed, 9 Jan 2013 18:39:40 -0800

I am glad this was useful.

Forgive me for not mentioning keys explicitly -- I can't imagine that
someone wouldn't use keys for this :)


On Wed, Jan 9, 2013 at 5:55 PM, Graydon <graydon@xxxxxxxxx> wrote:
> On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit:
>> This can be done in three steps:
>>  1. Get the distinct values of concat(@area,'+',@site) for all tree elements.
>>  2. Get a sorted sequence of the corresponding `loc` elements that
>> were produced in step 1. The sort is by the count of `tree` elements
>> that have that particular loc element as descendant.
>>  3. Recursively add to the "minList" the top of the sorted loc
>> elements that isn't yet in this "minList". Stop condition -- when
>> there is nothing more to add.
> This is very similar to what I did, after having woken up at the 3 AM
> local going "wait! I can do this!" and making rapid notes.  (Having
> learnt what happens should I _not_ make those notes!)
> (Much thanks to Wolfgang for saying "use keys!" last night.)
> What I did was to go through each tree element's loc element children, and
> count how many other such loc elements existed anywhere in the data set. (Keys
> were very useful for this, and just let me note that trying to stick the result
> of a key lookup that returns a node in a comment via xsl:sequence during
> debugging is something I should remember not to do more reliably.)
> Once I had that count associated with each loc element (@other), I deleted all
> the loc elements but the one with the highest value for @other.
> Third step was to make a unique list of the @area/@cite pairs contained in
> those surviving loc elements, and consider that to represent the minimum
> spanning set of documents that contain all the element parent/child pattern
> examples.
> (There are other steps about "the example of this parent/child pattern is in
> this document" and fetching those documents and so on, but those aren't in the
> code sample below, which is already plenty long enough.)
>> I believe that this would really produce the minimum sequence of
>> locations -- due to the fact that we pick locations in decreasing
>> order of trees that reference them.
> I think what I've got, which I suspect is equivalent in a graph-theoretic
> sense, also does this.  I am also really really glad my current clients aren't
> going to ask me to _prove_ that.
> The input set this was tested on had 19,503 documents in it; the minimum
> spanning set of examples that results has 294 documents in it, so about 1.5% of
> the total.  If that's not actually minimal, it's close enough.  Total,
> end-to-end run time (via Saxon 9.4.something in oXygen), from "make a
> collection out of this half-GB pile of documents" to "minimum spanning set of
> documents in a pile" is about 65 seconds. (The bit below, run independently on
> the big set of tree elements, takes about a second.)
> I'm really pleased with this result.  (Some of the previous attempts were
> taking more than half an hour of run time to not work....)
>> I would be willing to produce the code that implements this algorithm,
>> but I would need to be provided with a complete (but not too long)
>> source XML document -- so that the result could be easily verified
> That's extraordinarily kind of you, but I really did just need the algorithm
> hint.  (This time, at least! :)
> I'm greatly reassured at the similarity of the suggested methodology, too!
> Thanks, Dimitre!
> -- Graydon
> <xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" xmlns:xd="";
>   xmlns:xs=""; xmlns:xsl="";>
>   <xd:doc scope="stylesheet">
>     <xd:desc>
>       <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p>
>       <xd:p><xd:b>Author:</xd:b> graydon</xd:p>
>       <xd:p/>
>     </xd:desc>
>   </xd:doc>
>   <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/>
>   <xsl:template match="/">
>     <xsl:variable name="getLocCounts">
>       <xsl:apply-templates mode="countLoc" select="node()"/>
>     </xsl:variable>
>     <xsl:variable name="findMostOther">
>       <xsl:apply-templates mode="mostOther" select="$getLocCounts"/>
>     </xsl:variable>
>     <xsl:variable name="makeDocumentList">
>       <xsl:apply-templates mode="docList" select="$findMostOther"/>
>     </xsl:variable>
>     <xsl:sequence select="$makeDocumentList"/>
>   </xsl:template>
>   <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify">
>     <xsl:copy>
>       <xsl:apply-templates mode="#current" select="node()|@*"/>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="loc" mode="countLoc">
>     <xsl:copy>
>       <xsl:attribute name="other" select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/>
>       <xsl:apply-templates mode="countLoc" select="node()|@*"/>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="locations" mode="mostOther">
>     <xsl:copy>
>       <xsl:for-each-group group-by="@other" select="loc">
>         <xsl:sort data-type="number" order="descending" select="@other"/>
>         <xsl:if test="position() eq 1">
>           <xsl:sequence select="current-group()[1]"/>
>         </xsl:if>
>       </xsl:for-each-group>
>     </xsl:copy>
>   </xsl:template>
>   <xsl:template match="container" mode="docList">
>     <wkna-shared-cms>
>       <assembly area="testpub" xml:lang="en">
>         <xsl:for-each-group group-by="concat(@area,'|',@cite)" select="tree/locations/loc">
>           <xsl:sort select="current-grouping-key()"/>
>           <include area="{current-group()[1]/@area}" cite="{current-group()[1]/@cite}"/>
>         </xsl:for-each-group>
>       </assembly>
>     </wkna-shared-cms>
>   </xsl:template>
> </xsl:stylesheet>

Dimitre Novatchev
Truly great madness cannot be achieved without significant intelligence.
To invent, you need a good imagination and a pile of junk
Never fight an inanimate object
To avoid situations in which you might make mistakes may be the
biggest mistake of all
Quality means doing it right when no one is looking.
You've achieved success in your field when you don't know whether what
you're doing is work or play
Facts do not cease to exist because they are ignored.
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
I finally figured out the only reason to be alive is to enjoy it.

Current Thread