Subject: Re: [xsl] finding a minimum set of references From: Dimitre Novatchev <dnovatchev@xxxxxxxxx> Date: Wed, 9 Jan 2013 18:39:40 -0800 |
Graydon, I am glad this was useful. Forgive me for not mentioning keys explicitly -- I can't imagine that someone wouldn't use keys for this :) Cheers, Dimitre On Wed, Jan 9, 2013 at 5:55 PM, Graydon <graydon@xxxxxxxxx> wrote: > On Wed, Jan 09, 2013 at 04:56:44AM -0800, Dimitre Novatchev scripsit: >> This can be done in three steps: >> >> 1. Get the distinct values of concat(@area,'+',@site) for all tree elements. >> >> 2. Get a sorted sequence of the corresponding `loc` elements that >> were produced in step 1. The sort is by the count of `tree` elements >> that have that particular loc element as descendant. >> >> 3. Recursively add to the "minList" the top of the sorted loc >> elements that isn't yet in this "minList". Stop condition -- when >> there is nothing more to add. > > This is very similar to what I did, after having woken up at the 3 AM > local going "wait! I can do this!" and making rapid notes. (Having > learnt what happens should I _not_ make those notes!) > > (Much thanks to Wolfgang for saying "use keys!" last night.) > > What I did was to go through each tree element's loc element children, and > count how many other such loc elements existed anywhere in the data set. (Keys > were very useful for this, and just let me note that trying to stick the result > of a key lookup that returns a node in a comment via xsl:sequence during > debugging is something I should remember not to do more reliably.) > > Once I had that count associated with each loc element (@other), I deleted all > the loc elements but the one with the highest value for @other. > > Third step was to make a unique list of the @area/@cite pairs contained in > those surviving loc elements, and consider that to represent the minimum > spanning set of documents that contain all the element parent/child pattern > examples. > > (There are other steps about "the example of this parent/child pattern is in > this document" and fetching those documents and so on, but those aren't in the > code sample below, which is already plenty long enough.) > >> I believe that this would really produce the minimum sequence of >> locations -- due to the fact that we pick locations in decreasing >> order of trees that reference them. > > I think what I've got, which I suspect is equivalent in a graph-theoretic > sense, also does this. I am also really really glad my current clients aren't > going to ask me to _prove_ that. > > The input set this was tested on had 19,503 documents in it; the minimum > spanning set of examples that results has 294 documents in it, so about 1.5% of > the total. If that's not actually minimal, it's close enough. Total, > end-to-end run time (via Saxon 9.4.something in oXygen), from "make a > collection out of this half-GB pile of documents" to "minimum spanning set of > documents in a pile" is about 65 seconds. (The bit below, run independently on > the big set of tree elements, takes about a second.) > > I'm really pleased with this result. (Some of the previous attempts were > taking more than half an hour of run time to not work....) > >> I would be willing to produce the code that implements this algorithm, >> but I would need to be provided with a complete (but not too long) >> source XML document -- so that the result could be easily verified > > That's extraordinarily kind of you, but I really did just need the algorithm > hint. (This time, at least! :) > > I'm greatly reassured at the similarity of the suggested methodology, too! > > Thanks, Dimitre! > > -- Graydon > > > <xsl:stylesheet exclude-result-prefixes="xs xd" version="2.0" xmlns:xd="http://www.oxygenxml.com/ns/doc/xsl" > xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> > <xd:doc scope="stylesheet"> > <xd:desc> > <xd:p><xd:b>Created on:</xd:b> Jan 8, 2013</xd:p> > <xd:p><xd:b>Author:</xd:b> graydon</xd:p> > <xd:p/> > </xd:desc> > </xd:doc> > <xsl:key match="//loc" name="locKey" use="concat(@area,'|',@cite)"/> > <xsl:template match="/"> > <xsl:variable name="getLocCounts"> > <xsl:apply-templates mode="countLoc" select="node()"/> > </xsl:variable> > <xsl:variable name="findMostOther"> > <xsl:apply-templates mode="mostOther" select="$getLocCounts"/> > </xsl:variable> > <xsl:variable name="makeDocumentList"> > <xsl:apply-templates mode="docList" select="$findMostOther"/> > </xsl:variable> > <xsl:sequence select="$makeDocumentList"/> > </xsl:template> > <xsl:template match="node()|@*" mode="countLoc mostOther docList prettify"> > <xsl:copy> > <xsl:apply-templates mode="#current" select="node()|@*"/> > </xsl:copy> > </xsl:template> > <xsl:template match="loc" mode="countLoc"> > <xsl:copy> > <xsl:attribute name="other" select="count(key('locKey',concat(@area,'|',@cite))[not(. is current())])"/> > <xsl:apply-templates mode="countLoc" select="node()|@*"/> > </xsl:copy> > </xsl:template> > <xsl:template match="locations" mode="mostOther"> > <xsl:copy> > <xsl:for-each-group group-by="@other" select="loc"> > <xsl:sort data-type="number" order="descending" select="@other"/> > <xsl:if test="position() eq 1"> > <xsl:sequence select="current-group()[1]"/> > </xsl:if> > </xsl:for-each-group> > </xsl:copy> > </xsl:template> > <xsl:template match="container" mode="docList"> > <wkna-shared-cms> > <assembly area="testpub" xml:lang="en"> > <xsl:for-each-group group-by="concat(@area,'|',@cite)" select="tree/locations/loc"> > <xsl:sort select="current-grouping-key()"/> > <include area="{current-group()[1]/@area}" cite="{current-group()[1]/@cite}"/> > </xsl:for-each-group> > </assembly> > </wkna-shared-cms> > </xsl:template> > </xsl:stylesheet> > -- Cheers, Dimitre Novatchev --------------------------------------- Truly great madness cannot be achieved without significant intelligence. --------------------------------------- To invent, you need a good imagination and a pile of junk ------------------------------------- Never fight an inanimate object ------------------------------------- To avoid situations in which you might make mistakes may be the biggest mistake of all ------------------------------------ Quality means doing it right when no one is looking. ------------------------------------- You've achieved success in your field when you don't know whether what you're doing is work or play ------------------------------------- Facts do not cease to exist because they are ignored. ------------------------------------- Typing monkeys will write all Shakespeare's works in 200yrs.Will they write all patents, too? :) ------------------------------------- I finally figured out the only reason to be alive is to enjoy it.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] finding a minimum set of , Graydon | Thread | Re: [xsl] finding a minimum set of , Graydon |
Re: [xsl] finding a minimum set of , Graydon | Date | Re: [xsl] finding a minimum set of , Graydon |
Month |