Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))

Subject: Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))
From: Oliver Hallam <oliver@xxxxxxxxxxx>
Date: Wed, 13 Jul 2011 18:43:55 +0100
That is interesting.  I was aware that there are some very smart regex
engines out there, but wasn't aware that they had made it to any
XQuery/XSLT processors yet.

Another interesting article is this one describing some of the
optimizations performed by the regex engine in Google Chrome:
http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html

This mentions another trick used by some regex implementations.  In
their example "Sun|Mon", their engine recognises that a match for this
expression always contains "n" in the third character, and so rather
than testing for a match at each index in the string (which was the
problem with the example given) they first scan the string to find "n"
characters and only try to apply the regex starting two characters
preceding one.  I would not be at all suprised if they recognized that a
regex beginning .* needs only be applied to the first character.

Oliver
XQSharp


On 13/07/2011 15:13, Michael Kay wrote:

It would be perfectly valid (and sensible) for a query processor to realise that the two expressions you gave were equivalent and so not perform n^2 tests, but I am unaware of a processor that makes these kinds of optimizations to regular expressions.

Actually I've heard it said that there's a wide variation between different regex engines in how well they handle this kind of thing. See for example here:


http://swtch.com/~rsc/regexp/regexp1.html

The article at

http://eyalsch.wordpress.com/2009/05/21/regex/

is also useful.

Michael Kay
Saxonica

Current Thread