Re: [xsl] Calculating groups of repeating elements

Subject: Re: [xsl] Calculating groups of repeating elements
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 11 Dec 2008 11:01:54 -0500
Hi Quinn,

I'm not sure I follow the requirement perfectly either, but I thought this was interesting enough to give it some thought over night (I find puzzles can be relaxing), and maybe this idea would complement what Michael has suggested.

It seems to me that if you are wanting to collect groups of 2+ words that appear in 2+ places, a useful first step would be to collect the set of intersections of words occuring in every pairing of places. This would be a large number, n(n-1)/2 for n places, but not the huge exponent of 2 cited by Michael, and hence possibly a more direct route to your goal.

That is, for data:

<atlas>
  <place>
    <place_number>1</place_number>
    <words>
      <word>Aa</word>
      <word>C</word>
      <word>Qqq</word>
    </words>
  </place>
  <place>
    <place_number>2</place_number>
    <words>
      <word>Aa</word>
      <word>Bbbb</word>
      <word>C</word>
      <word>W</word>
      <word>Zz</word>
    </words>
  </place>

  <place>
    <place_number>3</place_number>
    <words>
      <word>Aa</word>
      <word>C</word>
      <word>Bb</word>
      <word>Qqq</word>
      <word>Wwww</word>
      <word>Zz</word>
    </words>
  </place>

</atlas>

this template

<xsl:template match="atlas">
    <collection>
      <xsl:for-each select="place">
        <xsl:variable name="first" select="."/>
        <xsl:for-each select="preceding-sibling::place">
          <xsl:variable name="second" select="."/>
          <common_words>
            <xsl:copy-of select="$first/place_number, $second/place_number"/>
            <words>
              <xsl:copy-of select="$first/words/word[.=$second/words/word]"/>
            </words>
          </common_words>
        </xsl:for-each>
      </xsl:for-each>
    </collection>
</xsl:template>

yields this result:

<?xml version="1.0" encoding="UTF-8"?>
<collection>
   <common_words>
      <place_number>2</place_number>
      <place_number>1</place_number>
      <words>
         <word>Aa</word>
         <word>C</word>
      </words>
   </common_words>
   <common_words>
      <place_number>3</place_number>
      <place_number>1</place_number>
      <words>
         <word>Aa</word>
         <word>C</word>
         <word>Qqq</word>
      </words>
   </common_words>
   <common_words>
      <place_number>3</place_number>
      <place_number>2</place_number>
      <words>
         <word>Aa</word>
         <word>C</word>
         <word>Zz</word>
      </words>
   </common_words>
</collection>

You didn't say how many places you have, so I don't know how large the set will get.

While this isn't quite what you want, the results you want could be derived by grouping these lists further, skipping pairings that contain less than two 'word' elements, and collecting together those have have the same sets (and thus represent sets of words that occur in more than two places).

If this approach is unsound, I'm sure a friendly mathematician can explain why. :->

I hope this helps,
Wendell

At 03:15 PM 12/10/2008, you wrote:
Hello,

I'm trying to calculate all of the groups of 2+ elements (in the sample data below, words) that appear together in more than one place. Ideally, I'd like to be able to sort descending both by length of group (5-word group, 4-word groups, etc), and by number of places the groups occur (100 places, 99 places, etc.) I also need to be able to list the place numbers where they occur.

I started doing it manually this way but the number of possible combinations quickly became too big a task:

<xsl:template match="/">
<xsl:value-of select="count(atlas/place/place_number[../words/word='Aa'] intersect atlas/place/place_number[../words/word='C'])"/>
</template>
(adding more "intersects" as necessary, and getting rid of the "count" to see the place numbers)


Here's a sample of the data. Almost every word appears in multiple places, but each appears only once in the index, which I've used in other applications for matching to avoid re-calculating stats for the word over and over. Any help would be wonderful!

<atlas>
<place>
<place_number>1</place_number>
<words>
<word>Aa</word>
<word>C</word>
<word>Qqq</word>
</words>
</place>

<place>
<place_number>2</place_number>
<words>
<word>Aa</word>
<word>Bbbb</word>
<word>C</word>
<word>W</word>
<word>Zz</word>
</words>
</place>

<place>
<place_number>3</place_number>
<words>
<word>Aa</word>
<word>C</word>
<word>Bb</word>
<word>Qqq</word>
<word>Wwww</word>
<word>Zz</word>
</words>
</place>

[etc]

<index>

<index_entry>
<underlying_word>A</underlying_word>
<word>A</word>
<word>Aa</word>
<word>Aaa</word>
</index_entry>

<index_entry>
<underlying_word>B</underlying_word>
<word>Bb</word>
<word>Bbbb</word>
</index_entry>

[etc]

</index>
</atlas>


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread