Re: [xsl] Efficient way to check sequence membership

Subject: Re: [xsl] Efficient way to check sequence membership
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Wed, 2 Mar 2011 13:34:45 -0800
One way is to have the strings sorted. Then use binary search -- this
is O(log(N)).

There is a standard FXSL function for this:  f:binSearch and here is the
code:

 <xsl:function name="f:binSearch" as="xs:boolean">
   <xsl:param name="pSeq" as="item()+"/>
   <xsl:param name="pVal" as="item()"/>
   <xsl:param name="pStart" as="item()"/>
   <xsl:param name="pEnd" as="item()"/>

   <xsl:sequence select=
    "if($pStart gt $pEnd) then false()
       else
        for $mid in ($pStart + $pEnd) idiv 2,
             $sVal in $pSeq[$mid]
           return
             if($sVal eq $pVal) then true()
             else
                if($sVal lt $pVal)
                   then f:binSearch($pSeq, $pVal, $mid+1, $pEnd)
                else
                  f:binSearch($pSeq, $pVal, $pStart, $mid - 1)
       "/>
 </xsl:function>

I have used the f:binSearch() functions for years -- for example in an
efficient spelling checker in which the dictionary is a sequence of
sorted strings. Even 5 years ago it was working with a speed of 3000
words per second when checking Shakespeare's Othello.


Another way is using XPath 3.0 and the Binary Search tree datatype
that I recently implemented.

This is described (and with full code) here:

http://dnovatchev.wordpress.com/2011/02/12/part-2-the-binary-search-data-stru
cture-with-xpath-3-0-deleting-a-node/



Hope this helped :)



On Wed, Mar 2, 2011 at 1:23 PM, Henry S. Thompson <ht@xxxxxxxxxxxx> wrote:
> A common requirement for me is to check if a particular string is a
> member of a set of other strings. B I often need this in an inner loop,
> doing it hundreds if not thousands of times.
>
> So, what's the most efficient way to do this?
>
> Consider the example (a real one) of checking to see if a word is what
> the IR people call a 'stop' word -- a short common word of little
> substance.
>
> Here's a function which checks this in a straightforward way:
>
> <xsl:variable
>
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|th
ese|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|do
ing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|aga
inst|between|into|through|during|before|after|above|below|to|from|up|down|in|
out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|a
ll|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|th
an|too|very|s|t|can|will|just|don|should|now</xsl:variable>
>
> B <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>
>
> B <xsl:function name="my:stop1" as="xs:boolean">
> B <xsl:param name="w" as="xs:string"/>
> B <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>
> B </xsl:function>
>
> For obvious reasons this seems unlikely to be very efficient.
>
> I've come up with 4 other ways to do this, and the best of them is
> over 6 times faster than that one, using a realistic set of test
> data. B The worst is more than twice as _slow_ as that one. B That's a
> factor of 12 between best and worst!
>
> Timing was done using saxon91 on a Windows box, using -repeat:100,
> subtracting a baseline function which always returns true w/o checking
> anything. B I called the stop function 289 times. B On 95 occasions the
> argument was a stop word (drawn from only 23 word types). B There are
> 104 stop words in $stopPat.
>
> I'll follow this message with another containing the details: skip it
> if you don't care, or have a think about what _you_ think is the best
> answer before reading.
>
> ht
> --
> B  B  B  Henry S. Thompson, School of Informatics, University of Edinburgh
> B  B  B 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131
650-4440
> B  B  B  B  B  B  B  B Fax: (44) 131 651-1426, e-mail: ht@xxxxxxxxxxxx
> B  B  B  B  B  B  B  B  B  B  B  URL: http://www.ltg.ed.ac.uk/~ht/
> B [mail from me _always_ has a .sig like this -- mail without it is forged
spam]
>
>



--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.

Current Thread