Re: [xsl] Efficient way to check sequence membership

Subject: Re: [xsl] Efficient way to check sequence membership
From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx>
Date: Wed, 02 Mar 2011 22:52:09 +0100
On 2011-03-02 22:23, Henry S. Thompson wrote:
A common requirement for me is to check if a particular string is a
member of a set of other strings.  I often need this in an inner loop,
doing it hundreds if not thousands of times.

So, what's the most efficient way to do this?

Consider the example (a real one) of checking to see if a word is what
the IR people call a 'stop' word -- a short common word of little
substance.

Here's a function which checks this in a straightforward way:

<xsl:variable
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|these|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|doing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|against|between|into|through|during|before|after|above|below|to|from|up|down|in|out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|all|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|than|too|very|s|t|can|will|just|don|should|now</xsl:variable>

<xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>

  <xsl:function name="my:stop1" as="xs:boolean">
   <xsl:param name="w" as="xs:string"/>
   <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>

It should be fairly efficient to use stopPat as a regex: <xsl:sequence select="matches( $w, concat('^(', $stopPat, ')$') )"/>

Maybe by including boundary markers (\b isn't available, but you can use (^|\W) and (\W|$)) and the replace function, you can avoid splitting your input into words $w in first place, which will make the regex approach even more efficient (depending on the underlying regex implementation, of course, but we're talking about saxon on Java here, I think).

-Gerrit

Current Thread