Subject: Re: [xsl] Efficient way to check sequence membership From: Dimitre Novatchev <dnovatchev@xxxxxxxxx> Date: Wed, 2 Mar 2011 13:34:45 -0800 |
One way is to have the strings sorted. Then use binary search -- this is O(log(N)). There is a standard FXSL function for this: f:binSearch and here is the code: <xsl:function name="f:binSearch" as="xs:boolean"> <xsl:param name="pSeq" as="item()+"/> <xsl:param name="pVal" as="item()"/> <xsl:param name="pStart" as="item()"/> <xsl:param name="pEnd" as="item()"/> <xsl:sequence select= "if($pStart gt $pEnd) then false() else for $mid in ($pStart + $pEnd) idiv 2, $sVal in $pSeq[$mid] return if($sVal eq $pVal) then true() else if($sVal lt $pVal) then f:binSearch($pSeq, $pVal, $mid+1, $pEnd) else f:binSearch($pSeq, $pVal, $pStart, $mid - 1) "/> </xsl:function> I have used the f:binSearch() functions for years -- for example in an efficient spelling checker in which the dictionary is a sequence of sorted strings. Even 5 years ago it was working with a speed of 3000 words per second when checking Shakespeare's Othello. Another way is using XPath 3.0 and the Binary Search tree datatype that I recently implemented. This is described (and with full code) here: http://dnovatchev.wordpress.com/2011/02/12/part-2-the-binary-search-data-stru cture-with-xpath-3-0-deleting-a-node/ Hope this helped :) On Wed, Mar 2, 2011 at 1:23 PM, Henry S. Thompson <ht@xxxxxxxxxxxx> wrote: > A common requirement for me is to check if a particular string is a > member of a set of other strings. B I often need this in an inner loop, > doing it hundreds if not thousands of times. > > So, what's the most efficient way to do this? > > Consider the example (a real one) of checking to see if a word is what > the IR people call a 'stop' word -- a short common word of little > substance. > > Here's a function which checks this in a straightforward way: > > <xsl:variable > name="stopPat">it|its|itself|they|them|their|what|which|who|whom|this|that|th ese|those|am|is|are|was|were|be|been|being|have|has|had|having|do|does|did|do ing|a|an|the|and|but|if|or|because|as|until|while|of|at|by|for|with|about|aga inst|between|into|through|during|before|after|above|below|to|from|up|down|in| out|on|off|over|under|again|further|then|once|here|there|when|where|why|how|a ll|any|both|each|few|more|most|other|some|such|no|nor|not|only|own|same|so|th an|too|very|s|t|can|will|just|don|should|now</xsl:variable> > > B <xsl:variable name="stops" select="tokenize($stopPat,'\|')"/> > > B <xsl:function name="my:stop1" as="xs:boolean"> > B <xsl:param name="w" as="xs:string"/> > B <xsl:sequence select="some $s in $stops satisfies ($s eq $w)"/> > B </xsl:function> > > For obvious reasons this seems unlikely to be very efficient. > > I've come up with 4 other ways to do this, and the best of them is > over 6 times faster than that one, using a realistic set of test > data. B The worst is more than twice as _slow_ as that one. B That's a > factor of 12 between best and worst! > > Timing was done using saxon91 on a Windows box, using -repeat:100, > subtracting a baseline function which always returns true w/o checking > anything. B I called the stop function 289 times. B On 95 occasions the > argument was a stop word (drawn from only 23 word types). B There are > 104 stop words in $stopPat. > > I'll follow this message with another containing the details: skip it > if you don't care, or have a think about what _you_ think is the best > answer before reading. > > ht > -- > B B B Henry S. Thompson, School of Informatics, University of Edinburgh > B B B 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 > B B B B B B B B Fax: (44) 131 651-1426, e-mail: ht@xxxxxxxxxxxx > B B B B B B B B B B B URL: http://www.ltg.ed.ac.uk/~ht/ > B [mail from me _always_ has a .sig like this -- mail without it is forged spam] > > -- Cheers, Dimitre Novatchev --------------------------------------- Truly great madness cannot be achieved without significant intelligence. --------------------------------------- To invent, you need a good imagination and a pile of junk ------------------------------------- Never fight an inanimate object ------------------------------------- You've achieved success in your field when you don't know whether what you're doing is work or play ------------------------------------- Facts do not cease to exist because they are ignored.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Efficient way to check sequen, Henry S. Thompson | Thread | Re: [xsl] Efficient way to check se, Henry S. Thompson |
[xsl] Efficient way to check sequen, Henry S. Thompson | Date | Re: [xsl] Efficient way to check se, Imsieke, Gerrit, le- |
Month |