RE: [xsl] Calculating groups of repeating elements

Subject: RE: [xsl] Calculating groups of repeating elements
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 11 Dec 2008 19:56:00 -0000
Given that you have 250 places and 75 words, it seems to me that one could
approach this as follows:

(1) Construct all pairs of words that appear together in at least one place,
and find all the places in which this pair occurs. Discard the groups that
appear in only one place. This gives you all the groups of size 2.

(2) For each group of size 2, and the set of places in which this group
appears, construct candidate groups of size 3 by combining the 2-group with
all the other words appearing in those places. Again, discard the 3-groups
that appear in only one place. This gives you all the 3-groups

(3) Continue recursively to construct the 4-groups, 5-groups, etc

(4) Now add a step (0) before step (1) which constructs the 1-groups; the
expansion to 2-groups uses exactly the same logic.

Now finding the 1-groups looks like this:

<xsl:function name="f:level-one-groups" as="element(group)*">
<for-each-group select="//word" group-by=".">
  <xsl:if test="count(current-group()) gt 1">
    <group>
      <word><xsl:value-of select="current-grouping-key()"/></word>
      <xsl:copy-of select="place_number"/>
    </group>
  </xsl:if>
</xsl:for-each-group> 
</xsl:function>

and finding the level-n groups given the level-n-1-groups is:

<xsl:function name="f:level-n-groups" as="element(group)*">
  <xsl:param name="level-n-1-groups" as="element(group)*"/>
      <xsl:variable name="g" select="$level-n-1-groups"/>
      <xsl:for-each select="$g">
        <xsl:variable name="places" 
                      select="//place[place-number = $g/place_number]"/>
        <xsl:variable name="otherWords" 
                      select="$places/words/word[not(. = $g/word)]"/>
        <xsl:variable name="n-gram" 
                      select="$g/word, ."/>
        <xsl:variable name="places-with-n-gram" 
                      select="$places[every $w in n-gram satisfies
./words/word = $w]"/>
        <xsl:if test="count($places-with-n-gram) gt 1">
          <group>
            <xsl:for-each select="$n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
            <xsl:for-each select="$places-with-n-gram">
               <word><xsl:value-of select="."/>
            </xsl:for-each>
          </group>
        </xsl:if>
     </xsl:for-each>
</xsl:function>

then you just have to do a recursion in which you start by finding the
level-1 groups, then recurse to levels 2, 3, 4 etc until you find there are
no groups at a particular level.

Not tested, of course, and almost certainly capable of improvement.

Michael Kay
http://www.saxonica.com/
          

> -----Original Message-----
> From: Quinn Dombrowski [mailto:qdombrow@xxxxxxxx] 
> Sent: 10 December 2008 20:16
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Calculating groups of repeating elements
> 
> Hello,
> 
> I'm trying to calculate all of the groups of 2+ elements (in 
> the sample data below, words) that appear together in more 
> than one place. Ideally, I'd like to be able to sort 
> descending both by length of group (5-word group, 4-word 
> groups, etc), and by number of places the groups occur (100 
> places, 99 places, etc.) I also need to be able to list the 
> place numbers where they occur.
> 
> I started doing it manually this way but the number of 
> possible combinations quickly became too big a task:
> 
> <xsl:template match="/">
> <xsl:value-of 
> select="count(atlas/place/place_number[../words/word='Aa']
> intersect atlas/place/place_number[../words/word='C'])"/>
> </template>
> (adding more "intersects" as necessary, and getting rid of 
> the "count" 
> to see the place numbers)
> 
> Here's a sample of the data. Almost every word appears in 
> multiple places, but each appears only once in the index, 
> which I've used in other applications for matching to avoid 
> re-calculating stats for the word over and over. Any help 
> would be wonderful!
> 
> <atlas>
> <place>
> <place_number>1</place_number>
> <words>
> <word>Aa</word>
> <word>C</word>
> <word>Qqq</word>
> </words>
> </place>
> 
> <place>
> <place_number>2</place_number>
> <words>
> <word>Aa</word>
> <word>Bbbb</word>
> <word>C</word>
> <word>W</word>
> <word>Zz</word>
> </words>
> </place>
> 
> <place>
> <place_number>3</place_number>
> <words>
> <word>Aa</word>
> <word>C</word>
> <word>Bb</word>
> <word>Qqq</word>
> <word>Wwww</word>
> <word>Zz</word>
> </words>
> </place>
> 
> [etc]
> 
> <index>
> 
> <index_entry>
> <underlying_word>A</underlying_word>
> <word>A</word>
> <word>Aa</word>
> <word>Aaa</word>
> </index_entry>
> 
> <index_entry>
> <underlying_word>B</underlying_word>
> <word>Bb</word>
> <word>Bbbb</word>
> </index_entry>
> 
> [etc]
> 
> </index>
> </atlas>

Current Thread