Re: How can I filter stoppwords

Subject: Re: How can I filter stoppwords
From: Jeni Tennison <mail@xxxxxxxxxxxxxxxx>
Date: Sat, 02 Sep 2000 09:24:29 +0100
Barbara,

>Does anybody know another way to filter stopp words? 

I'm not sure, but I think you were only after filtering stop words that
start the name of the book?  Adapting Eric's solution:

The xsl:stylesheet element declares the necessaries, and the additional
namespace 'sw' that is used for the internal data (the list of stop words).
 To prevent this namespace being declared on your output, use
'exclude-result-prefixes':

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                xmlns:sw="mailto:vdv@xxxxxxxxxxxx";
                exclude-result-prefixes="sw">
  ...
</xsl:stylesheet>

Then the declaration of the stop words that you want to filter out.  I've
put these in a variable so that they can be accessed easily:

<sw:stop>
  <word>the</word>
  <word>a</word>
  <word>is</word>
</sw:stop>

<xsl:variable name="stop-words" 
              select="document('')/xsl:stylesheet/sw:stop/word" />

Declaration of two variables so that we can translate between upper and
lower case fairly easily:

<xsl:variable name="lowercase" select="'abcdefghijklmnopqrstuvwxyz'" />
<xsl:variable name="uppercase" select="'ABCDEFGHIJKLMNOPQRSTUVWXYZ'" />

Now the template.  I've only used one for brevity, but of course you can
split it down into several through calling and applying templates.  Within
this template, I iterate through each of the titles.  For each title, I
find all the stop words such that the current title starts with that stop
word (plus a space, and all ignoring case).  If there is such a match, then
the title is substring()ed to give the resulting title by taking off the
characters that make up the word it begins with.

<xsl:template match="/">
  <result>
    <xsl:for-each select="xmlfile/book/title">
      <before><xsl:value-of select="." /></before>
      <xsl:variable name="begins-with"
  select="$stop-words[starts-with(translate(current(), $uppercase,
$lowercase), 
                                  concat(translate(., $uppercase,
$lowercase), 
                                         ' '))]" />
      <after>
        <xsl:choose>
          <xsl:when test="$begins-with">
            <xsl:value-of
              select="substring(., string-length($begins-with) + 2)" />
          </xsl:when>
          <xsl:otherwise>
            <xsl:value-of select="." />
          </xsl:otherwise>
        </xsl:choose>
      </after>
    </xsl:for-each>
  </result>
</xsl:template>

This strips leading stop words in SAXON and MSXML (July).  It works in
Xalan-C++ v.0.40.0 except for the exclude-result-prefixes thing, which is
ignored.

However...

>How do you XSL-create a sort criterion? 

...you can't (at the moment) use a template to create a string to use as a
sort criterion.  Sort criteria have to be XPath select expressions.  This
problem will go away when (a) you can convert RTFs to node sets and/or (b)
when you can use something like saxon:function to declare extension
functions within XSLT.

For the meantime, then you have to use something really horrible like:

<xsl:template match="/">
  <result>
    <xsl:for-each select="xmlfile/book/title">
      <xsl:sort select="concat(substring(substring-after(., ' '), 0 div
boolean($stop-words[starts-with(translate(current(), $uppercase,
$lowercase), concat(translate(., $uppercase, $lowercase), ' '))])),
substring(., 0 div not($stop-words[starts-with(translate(current(),
$uppercase, $lowercase), concat(translate(., $uppercase, $lowercase), '
'))])))" />
      <title><xsl:value-of select="." /></title>
    </xsl:for-each>
  </result>
</xsl:template>

(Honestly, it doesn't look that much clearer even when it *is* indented ;)

This works in SAXON, MSXML (July) and Xalan (with the exception of the
result-prefixes thing).

I hope that helps,

Jeni

Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread