[xsl] Re: grouping and word counting

Subject: [xsl] Re: grouping and word counting
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sat, 19 Jul 2003 18:56:04 +0200
Hi Marina,

One can use the string tokeniser from FXSL (the "str-split-to-words"
template) in order to obtain a list of words from a string and then count
them.

This, combined with the Muenchian method for grouping gives us the following
solution.

This transformation:

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
 xmlns:ext="http://exslt.org/common";
 exclude-result-prefixes="ext">

 <xsl:import href="strSplit-to-Words.xsl"/>

  <xsl:output method="text"/>

  <xsl:key name="kMsg" match="MESSAGE" use="."/>

  <xsl:key name="kByCount" match="m" use="@count"/>

  <xsl:template match="/">
    <xsl:variable name="vPass1">
      <xsl:for-each
        select="/*/*/MESSAGE[generate-id()
                            =
                             generate-id(key('kMsg',
                                             .
                                             )[1]
                                         )
                             ]">
         <xsl:sort select="count(key('kMsg',.))"
                   data-type="number"/>
         <m count="{count(key('kMsg',.))}"
            text="{.}"/>
      </xsl:for-each>
    </xsl:variable>

    <xsl:for-each
    select="ext:node-set($vPass1)/m
                   [generate-id()
                   =
                    generate-id(key('kByCount',
                                     @count
                                    )[1]
                                )
                   ]">
      <xsl:sort select="count(key('kByCount', @count))"
           data-type="number"/>

      <xsl:variable name="vAllText">
        <xsl:for-each select="key('kByCount', @count)">
          <xsl:value-of select="concat(' ', @text, ' ')"/>
        </xsl:for-each>
      </xsl:variable>

      <xsl:variable name="vrtfWords">
        <xsl:call-template name="str-split-to-words">
          <xsl:with-param name="pStr" select="$vAllText"/>
          <xsl:with-param name="pDelimiters" select="' '"/>
        </xsl:call-template>
      </xsl:variable>

      <xsl:variable name="vAvWords"
       select="(count(ext:node-set($vrtfWords)/word) - 1)
             div
               count(key('kByCount', @count))"/>

      <xsl:value-of select="concat(count(key('kByCount',
                                              @count
                                             )
                                         ),
                                   ' ',
                                   @count,
                                   ' ',
                                   $vAvWords,
                                   '&#xA;'
                                   )"/>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>


when applied on your source.xml:

<LOG>
  <SENT>
    <USER> 12345 </USER>
    <LOCATION> 55555 </LOCATION>
    <TARGET> 1 </TARGET>
    <TARGET_LOCATION> 23222 </TARGET_LOCATION>
    <MESSAGE> hello Fred </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 3 </TARGET>
    <TARGET_LOCATION> 34444 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 77777 </USER>
    <LOCATION> 76666 </LOCATION>
    <TARGET> 4 </TARGET>
    <TARGET_LOCATION> 67777 </TARGET_LOCATION>
    <MESSAGE> nice weather </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 8 </TARGET>
    <TARGET_LOCATION> 98765 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
  <SENT>
    <USER> 33333 </USER>
    <LOCATION> 12666 </LOCATION>
    <TARGET> 9 </TARGET>
    <TARGET_LOCATION> 46578 </TARGET_LOCATION>
    <MESSAGE> whats the latest news? </MESSAGE>
  </SENT>
</LOG>


produces the wanted result:

1 1 2
2 2 3


Hope this helped.


=====
Cheers,

Dimitre Novatchev.
http://fxsl.sourceforge.net/ -- the home of FXSL


"marina" <marina777uk@xxxxxxxxx> wrote in message
news:20030719075801.60127.qmail@xxxxxxxxxxxxxxxxxxxxxxxxxx
> Hi,
>
> I have an XML document that contains messages sent by
> people to one another. Many of these messages in the
> <MESSAGE> tags are repeated as they are sent by one
> person to many others.
>
> XML Snippet:
> --------------------------------------------------
> <LOG>
>    <SENT>
>       <USER> 12345 </USER>
>       <LOCATION> 55555 </LOCATION>
>       <TARGET> 1 </TARGET>
>       <TARGET_LOCATION> 23222 </TARGET_LOCATION>
>       <MESSAGE> hello Fred </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 77777 </USER>
>       <LOCATION> 76666 </LOCATION>
>       <TARGET> 3 </TARGET>
>       <TARGET_LOCATION> 34444 </TARGET_LOCATION>
>       <MESSAGE> nice weather </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 77777 </USER>
>       <LOCATION> 76666 </LOCATION>
>       <TARGET> 4 </TARGET>
>       <TARGET_LOCATION> 67777 </TARGET_LOCATION>
>       <MESSAGE> nice weather </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 33333 </USER>
>       <LOCATION> 12666 </LOCATION>
>       <TARGET> 8 </TARGET>
>       <TARGET_LOCATION> 98765 </TARGET_LOCATION>
>       <MESSAGE> whats the latest news? </MESSAGE>
>    </SENT>
>    <SENT>
>       <USER> 33333 </USER>
>       <LOCATION> 12666 </LOCATION>
>       <TARGET> 9 </TARGET>
>       <TARGET_LOCATION> 46578 </TARGET_LOCATION>
>       <MESSAGE> whats the latest news? </MESSAGE>
>    </SENT>
> </LOG>
> --------------------------------------------------
> What I need to do is:-
>
> 1) Find out how many messages over all were sent to 1,
> 2, 3 etc people.
>
> As a duplicated message will always follow the
> original, i.e. be the next <MESSAGE> tag of the
> following sibling node, I'm thinking that the
> stylesheet would start with the first message and keep
> comparing siblings until it found one that was
> different. Then it would just add the previous number
> of sibling nodes? ( I probably need to use keys?)
>
> 2) For each of the total messages per group size,
> calculate the average number of words. No idea on this
> one I'm afraid!
>
> So the desired output from the snippet above would be:
> -
>
> Group Size Number of Messages Av Number Words
>     1 1 2
>     2 2 3
>  (up to say 20)
>
> Many thanks in advance for any help,
>
> Marina
>
>
>
>
> __________________________________
> Do you Yahoo!?
> SBC Yahoo! DSL - Now only $29.95 per month!
> http://sbc.yahoo.com
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>
>




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread