Re: [xsl] Collect word count with xslt2.0 on saxon 8

Subject: Re: [xsl] Collect word count with xslt2.0 on saxon 8
From: George Cristian Bina <george@xxxxxxxxxxxxx>
Date: Tue, 16 May 2006 10:04:22 +0300
Hello Karen,

You can get the count of words more easily than that. First you can get the text in a variable that belongs to an element with topic/topic but not to other elements inside it with the same mark and then just count the words in that.
For getting the text once we match on a topic/topic element we use a new mode for apply-template on which we do nothing on elements with topic/topic thus we exclude their text content.


The following stylesheet shows that

<xsl:stylesheet version="2.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
  <xsl:output indent="yes"/>
  <xsl:template match="/">
    <counts>
      <xsl:apply-templates/>
    </counts>
  </xsl:template>
  <xsl:template match="text()"/>
  <xsl:template match="*[contains(@class, 'topic/topic')]">
    <xsl:variable name="text">
      <xsl:apply-templates mode="getText" select="node()"/>
    </xsl:variable>
    <record>
      <text>
        <xsl:value-of select="$text"/>
      </text>
      <count>
        <xsl:value-of

select="count(tokenize(lower-case($text),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"
        />
      </count>
    </record>
    <xsl:apply-templates/>
  </xsl:template>

  <xsl:template match="*[contains(@class, 'topic/topic')]"
    mode="getText"/>
</xsl:stylesheet>

on your sample input it gives:

<?xml version="1.0" encoding="UTF-8"?>
<counts>
   <record>
      <text>
     communications and information theory
     top element
     elements can be nested Generalized Markup
    Language defined by ISO 8879.
  </text>
      <count>17</count>
   </record>
   <record>
      <text>
       communications and information theory
       top element
       elements can be nested (for a number of
      technical reasons beyond the scope of this article).
    </text>
      <count>22</count>
   </record>
   <record>
      <text>
         communications and information theory
         top element
         elements can be nested maintain repositories
        of structured documentation for more than a decade, but it is
        not well
      </text>
      <count>25</count>
   </record>
   <record>
      <text> But
          the metrics for XML on the Web  communications and
            information theory
           top element
           elements can be nested measures, or are a
          little polluted by voodoo ideology about good </text>
      <count>29</count>
   </record>
</counts>

Best Regards,
George
---------------------------------------------------------------------
George Cristian Bina
<oXygen/> XML Editor, Schema Editor and XSLT Editor/Debugger
http://www.oxygenxml.com


Karen McAdams wrote:
I have the following structure that i need to collect
word counts for from each element that has a class
attribute that contains " topic/topic " without counting its child elements that also contain
the the class attribute " topic/topic "




root>
<topic class=" topic/topic foo/bar ">
<p> communications and information theory</p>
<title> top element</title>
<relinfo> elements can be nested</relinfo> Generalized Markup Language defined by ISO
8879.
<concept class=" topic/topic foo/bar ">
<p> communications and information
theory</p>
<title> top element</title>
<relinfo> elements can be nested</relinfo>


(for a number of technical reasons beyond
the scope of this article).
<topic class=" topic/topic foo/bar ">
<p> communications and information
theory</p>
<title> top element</title>
<relinfo> elements can be
nested</relinfo> maintain repositories of structured documentation for more than a decade, but it is not
well <concept class=" topic/topic foo/bar
">
But the metrics for XML on the Web
<p> communications and
information theory</p>
<title> top element</title>
<relinfo> elements can be
nested</relinfo> measures, or are a little polluted
by voodoo ideology about good </concept>
</topic>
</concept>
</topic>
</root>


I have this template that gets the word count for each
element and its child elements including the elements
that have class attributes that contains  "
topic/topic ".

 <xsl:template match="*[contains(@class, 'topic/topic
')]">
        <xsl:variable name="level"
select="count(ancestor::*[contains(@class,
'topic/topic ')]) + 1"/>
        <xsl:variable name="ct" select="if ($level =
1) then concat(title,' ') else ' '"/>
        <xsl:variable name="h1" select="if ($level =
2) then concat(title,' ') else ' '"/>
        <xsl:variable name="h2" select="if ($level =
3) then concat(title,' ') else ' '"/>
        <xsl:variable name="h3" select="if ($level =
4) then concat(title,' ') else ' '"/>

<xsl:variable name="wc"
select="count(tokenize(lower-case(.),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"
/>


        <xsl:apply-templates/>
    </xsl:template>


I added another template that contains the count of its child elements b

<xsl:template match="*[contains(@class,
'topic/topic ')]" mode="filterCount">
<sum>
<xsl:value-of
select="count(tokenize(lower-case(.),'(\s|[,.!:;]|[n][b][s][p][;])+')[string(.)])"/>
</sum>
</xsl:template>


That I store in a variable and then subtract from the
total within in the first template above

<xsl:variable name="childcounts">
<sums>
<xsl:apply-templates
mode="filterCount"/> </sums>
</xsl:variable>


        <xsl:variable name="total-child"
select="sum($childcounts/sums/sum)"/>
        <xsl:variable name="total-roman"
select="sum($wc - $total-child)"/>


I would like to find a more elegant approach to this because there are also other attributes in this content that need to have the same technique applied to b

Would it be a better approach to copy the elements to
another document node and then perform the word count
which would be applied recursively to all child
elements to arrive at the count and what would this
template match look like?

Current Thread