[xsl] Efficient way to check sequence membership

Subject: [xsl] Efficient way to check sequence membership
From: ht@xxxxxxxxxxxx (Henry S. Thompson)
Date: Wed, 02 Mar 2011 21:23:53 +0000
A common requirement for me is to check if a particular string is a
member of a set of other strings.  I often need this in an inner loop,
doing it hundreds if not thousands of times.

So, what's the most efficient way to do this?

Consider the example (a real one) of checking to see if a word is what
the IR people call a 'stop' word -- a short common word of little
substance.

Here's a function which checks this in a straightforward way:

<xsl:variable
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|should|now</xsl:variable>

 <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>

 <xsl:function name="my:stop1" as="xs:boolean">
  <xsl:param name="w" as="xs:string"/>
  <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>
 </xsl:function>

For obvious reasons this seems unlikely to be very efficient.

I've come up with 4 other ways to do this, and the best of them is
over 6 times faster than that one, using a realistic set of test
data.  The worst is more than twice as _slow_ as that one.  That's a
factor of 12 between best and worst!

Timing was done using saxon91 on a Windows box, using -repeat:100,
subtracting a baseline function which always returns true w/o checking
anything.  I called the stop function 289 times.  On 95 occasions the
argument was a stop word (drawn from only 23 word types).  There are
104 stop words in $stopPat.

I'll follow this message with another containing the details: skip it
if you don't care, or have a think about what _you_ think is the best
answer before reading.

ht
-- 
       Henry S. Thompson, School of Informatics, University of Edinburgh
      10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440
                Fax: (44) 131 651-1426, e-mail: ht@xxxxxxxxxxxx
                       URL: http://www.ltg.ed.ac.uk/~ht/
 [mail from me _always_ has a .sig like this -- mail without it is forged spam]

Current Thread