Subject: Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))|
From: Oliver Hallam <oliver@xxxxxxxxxxx>
Date: Wed, 13 Jul 2011 18:43:55 +0100
That is interesting. I was aware that there are some very smart regex engines out there, but wasn't aware that they had made it to any XQuery/XSLT processors yet.
Another interesting article is this one describing some of the optimizations performed by the regex engine in Google Chrome: http://blog.chromium.org/2009/02/irregexp-google-chromes-new-regexp.html
This mentions another trick used by some regex implementations. In their example "Sun|Mon", their engine recognises that a match for this expression always contains "n" in the third character, and so rather than testing for a match at each index in the string (which was the problem with the example given) they first scan the string to find "n" characters and only try to apply the regex starting two characters preceding one. I would not be at all suprised if they recognized that a regex beginning .* needs only be applied to the first character.
It would be perfectly valid (and sensible) for a query processor to realise that the two expressions you gave were equivalent and so not perform n^2 tests, but I am unaware of a processor that makes these kinds of optimizations to regular expressions.
Actually I've heard it said that there's a wide variation between different regex engines in how well they handle this kind of thing. See for example here:
The article at
is also useful.
Michael Kay Saxonica