Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))

Subject: Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))
From: Liam R E Quin <liam@xxxxxx>
Date: Wed, 13 Jul 2011 19:12:30 -0400
O
> Another interesting article is this one describing some of the
> optimizations performed by the regex engine in Google Chrome:
> http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html
> 
> This mentions another trick used by some regex implementations.  In
> their example "Sun|Mon", their engine recognises that a match for this
> expression always contains "n" in the third character, and so rather
> than testing for a match at each index in the string (which was the
> problem with the example given) they first scan the string to find "n"
> characters and only try to apply the regex starting two characters
> preceding one.

This is similar to the Boyer-Moore algorithm that Perl (and probably
pcre) also uses - originally it was just for constant strings. There was
a fun race to get the fastest "grep" in the early 1990s.

The difficulty with a lot of these optimizations is that capturing
groups can make them harder - and in Perl people use capturing groups
all the time of course. You can't turn (font|foot) into fo(nt|ot), and
although you can turn it into (fo(?:nt|ot)) you may then lose the
benefit.

Since the article Mike Key mentioned was written in 2007, Perl's regexps
seem to have got a lot faster. But even in the early 1990s I remember
converting sed scripts into perl and getting huge speedup, even though
Perl was using recursive back-tracking. The BM-style delta tables
probably accounted for most of that.

Liam


-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/

Current Thread