|
Subject: Re: [xsl] Efficient way to check sequence membership From: Dimitre Novatchev <dnovatchev@xxxxxxxxx> Date: Wed, 2 Mar 2011 13:34:45 -0800 |
One way is to have the strings sorted. Then use binary search -- this
is O(log(N)).
There is a standard FXSL function for this: f:binSearch and here is the
code:
<xsl:function name="f:binSearch" as="xs:boolean">
<xsl:param name="pSeq" as="item()+"/>
<xsl:param name="pVal" as="item()"/>
<xsl:param name="pStart" as="item()"/>
<xsl:param name="pEnd" as="item()"/>
<xsl:sequence select=
"if($pStart gt $pEnd) then false()
else
for $mid in ($pStart + $pEnd) idiv 2,
$sVal in $pSeq[$mid]
return
if($sVal eq $pVal) then true()
else
if($sVal lt $pVal)
then f:binSearch($pSeq, $pVal, $mid+1, $pEnd)
else
f:binSearch($pSeq, $pVal, $pStart, $mid - 1)
"/>
</xsl:function>
I have used the f:binSearch() functions for years -- for example in an
efficient spelling checker in which the dictionary is a sequence of
sorted strings. Even 5 years ago it was working with a speed of 3000
words per second when checking Shakespeare's Othello.
Another way is using XPath 3.0 and the Binary Search tree datatype
that I recently implemented.
This is described (and with full code) here:
http://dnovatchev.wordpress.com/2011/02/12/part-2-the-binary-search-data-stru
cture-with-xpath-3-0-deleting-a-node/
Hope this helped :)
On Wed, Mar 2, 2011 at 1:23 PM, Henry S. Thompson <ht@xxxxxxxxxxxx> wrote:
> A common requirement for me is to check if a particular string is a
> member of a set of other strings. B I often need this in an inner loop,
> doing it hundreds if not thousands of times.
>
> So, what's the most efficient way to do this?
>
> Consider the example (a real one) of checking to see if a word is what
> the IR people call a 'stop' word -- a short common word of little
> substance.
>
> Here's a function which checks this in a straightforward way:
>
> <xsl:variable
>
name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|th
ese|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|do
ing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|aga
inst|between|into|through|during|before|after|above|below|to|from|up|down|in|
out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|a
ll|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|th
an|too|very|s|t|can|will|just|don|should|now</xsl:variable>
>
> B <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/>
>
> B <xsl:function name="my:stop1" as="xs:boolean">
> B <xsl:param name="w" as="xs:string"/>
> B <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/>
> B </xsl:function>
>
> For obvious reasons this seems unlikely to be very efficient.
>
> I've come up with 4 other ways to do this, and the best of them is
> over 6 times faster than that one, using a realistic set of test
> data. B The worst is more than twice as _slow_ as that one. B That's a
> factor of 12 between best and worst!
>
> Timing was done using saxon91 on a Windows box, using -repeat:100,
> subtracting a baseline function which always returns true w/o checking
> anything. B I called the stop function 289 times. B On 95 occasions the
> argument was a stop word (drawn from only 23 word types). B There are
> 104 stop words in $stopPat.
>
> I'll follow this message with another containing the details: skip it
> if you don't care, or have a think about what _you_ think is the best
> answer before reading.
>
> ht
> --
> B B B Henry S. Thompson, School of Informatics, University of Edinburgh
> B B B 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131
650-4440
> B B B B B B B B Fax: (44) 131 651-1426, e-mail: ht@xxxxxxxxxxxx
> B B B B B B B B B B B URL: http://www.ltg.ed.ac.uk/~ht/
> B [mail from me _always_ has a .sig like this -- mail without it is forged
spam]
>
>
--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
| Current Thread |
|---|
|
| <- Previous | Index | Next -> |
|---|---|---|
| [xsl] Efficient way to check sequen, Henry S. Thompson | Thread | Re: [xsl] Efficient way to check se, Henry S. Thompson |
| [xsl] Efficient way to check sequen, Henry S. Thompson | Date | Re: [xsl] Efficient way to check se, Imsieke, Gerrit, le- |
| Month |