Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))

Subject: Re: [xsl] why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))
From: Alex Muir <alex.g.muir@xxxxxxxxx>
Date: Wed, 13 Jul 2011 12:55:16 +0000
Thanks Oliver, that makes a lot of sense.

Regards

On Wed, Jul 13, 2011 at 12:33 PM, Oliver Hallam <oliver@xxxxxxxxxxx> wrote:
> The way that fn:matches usually works is that it attempts to apply the
> regular expression starting from each character in the string.
>
> So, looking at the regular expression ".*?Expr" where Expr is another
> regular expression:
>
> .*? is a reluctant match and so we try to match the shortest expression
> possible first.
>
> Starting from the first character:
>  Try match the empty string for .*? and then try to mach Expr starting from
> the first character.
>  Try to match a single character for .*? and then try to match Expr
starting
> from the second character.
>  Try to match two characters for .*? and then try to match Expr starting
> from the third character.
>  ...
>
> Thus in trying to find a match for ".*?Expr" starting at a particular
> character we try to match Expr starting at every other caharcter in the
> string, until a match is found.
>
> fn:matches attempts to find a match for the regular expression starting
from
> each character in the string.  If a match isnt found starting at the first
> character then it will try again starting at the second and so on.  So if
> there are no matches then Expr is tested n^2 times where n is the number of
> characters in the string, wheras matches applied to Expr would just test it
> n times.
>
> Now, if your query was (note the '^'):
>
> matches($title,'^.*?(\.|,)\s*$')
>
> then I would expect perfomance to be very similar to:
>
> matches($title,'(\.|,)\s*$')
>
>
> It would be perfectly valid (and sensible) for a query processor to realise
> that the two expressions you gave were equivalent and so not perform n^2
> tests, but I am unaware of a processor that makes these kinds of
> optimizations to regular expressions.
>
>
> Oliver Hallam
> XQSharp
>
>
> On 12/07/2011 17:26, Alex Muir wrote:
>>
>> Hi,
>>
>> I'm wondering why matches($title,'.*?(\.|,)\s*$')) can perform so much
>> worse than matches($title,'(\.|,)\s*$'))
>>
>> I found at least in one file out of thousands that I process the first
>> one can take a good 30 minutes to complete and the second is quick.
>>
>> No doubt the .*? causes some problem but what exactly is the problem
>> that is causes?
>>
>> Regards
>>
>> --
>> Alex Muir
>> Instructor | Program Organizer - University Technology Student Work
>> Experience Building
>> University of the Gambia
>> http://sites.utg.edu.gm/alex/
>>
>> Low budget software development benefiting development in the Gambia,
>> West Africa
>> Experience of a lifetime, come to Gambia and Join UTSWEB -
>> http://sites.utg.edu.gm/utsweb/
>
>



--
Alex Muir
Instructor | Program Organizer - University Technology Student Work
Experience Building
University of the Gambia
http://sites.utg.edu.gm/alex/

Low budget software development benefiting development in the Gambia,
West Africa
Experience of a lifetime, come to Gambia and Join UTSWEB -
http://sites.utg.edu.gm/utsweb/

Current Thread