Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?

Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?
From: Alex Muir <alex.g.muir@xxxxxxxxx>
Date: Wed, 29 Sep 2010 13:29:08 +0000
Thanks wolfgang,

I tried that regex in regexBuddy and it works effectively and takes
less steps in matching and non matching cases.

^[A-Z][A-Za-z]+(\s+[A-Z][A-Za-z]*)*$

Thanks
Alex

On Wed, Sep 29, 2010 at 10:10 AM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
wrote:
> On 28 September 2010 22:43, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
>>
>> Well turns out the problem was a combination of factors but was the
>> following regex which depending on the input given by the other 4
>> variables would non terminate or run fast or slow... I suppose what
>> was confusing me the most was that for most files I process it was
>> running quickly and removing one or another variable led to
>> improvements just because of chance given the input files.
>>
>> and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$')
>
> This is really evil, as it will backtrack exponentially. The "\s*"
> isn't providing
> separation between words; a "\s+" should do the trick. "{0,} isn't wrong,
> but why not use "*"?
>
>>
>>
>> I wrote this instead
>>
>>   and
(matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$')
>>            or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))">
>>
>
> I don't see the point of using two expressions, or "+?".
>
> To match a string consisting entirely of capitalized words sparated by
> white space:
>
>    ^[A-Z][A-Za-z]*(\s+[A-Z][A-Za-z]*)*$
>
> You may add \s* at the end to handle optional trailing white space.
>
> -W
>
>> The first one looks for title or upper case words and the second just one
word.
>>
>>
>> I see now from the profileroutput makes that clear given that
>>
>> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")
>>
>> Takes so long and the calls below it take so little time.
>>
>>
>> > > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk")
>> >s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title")
>> >s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let
(name="titleBraketedTextRemoved")
>> >s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let
(name="titleNumberRemoved")
>> >s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStripPunctuation")
>> >s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStopWordsRemoved")
>> >s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose
>> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")
>> >s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="normalize-space")
>> >s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:removeStopwords")
>> >s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function
(name="mh:removeStopwords") (as="xs:string?")
>> >s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:stripPunctuation")
>> >s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch
>> >s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call
(name="not")
>> >s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace")
>> > > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords")
(select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before',
'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or',
'other', 'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')"
>> )
>>
>> Thanks Much
>> Alex
>>
>>
>> On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
wrote:
>> > Two comments, which may not shed any light on the non-termination, but
anyway:
>> >
>> > First, the pattern "\([^\)]*\)" is supposed to remove any
>> > parenthesized text, but there's
>> > no point in using "[^\)]" since the set of "any character except ')'
>> > is simply denoted
>> > by "[^)]" becaue a parenthesis is not a meta-character within brackets.
>> >
>> > Second, to remove all characters of a kind (single character or class)
>> > it's better
>> > form to use a repetition, e.g.,  "\d+" rather than just "\d".
>> >
>> > -W
>> >
>> >
>> > On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
>> >> Hi,
>> >>
>> >> I found something quite interesting which may help further understand
the issue.
>> >>
>> >> Independently none of the following variable takes long to process,
>> >> such that when  I no longer chain the variables together but just run
>> >> the template calling only one variable and comment out the others the
>> >> time to run is short.
>> >>
>> >>   <xsl:variable name="title"
>> >>       select="mh:stripTextNewline(normalize-space(.))"/>
>> >>
>> >>     <xsl:variable name="titleBraketedTextRemoved"
>> >>       select="replace($title,'\([^\)]*\)','')"/>
>> >>
>> >>     <xsl:variable name="titleNumberRemoved"
>> >>       select="replace($titleBraketedTextRemoved,'\d','')"/>
>> >>
>> >>     <xsl:variable name="titleStripPunctuation"
>> >>       select="mh:stripPunctuation($titleNumberRemoved)"/>
>> >>
>> >>     <xsl:variable name="titleStopWordsRemoved"
>> >>      
select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords)
)"/>
>> >>
>> >> As the variables are combined together they take more and more time to
>> >> execute and finally if all together they do not stop running.
>> >>
>> >> So initially I was wrong to suggest that the titleBraketedTextRemoved
>> >> variable was causing the problem. It's just that the problem is
>> >> exacerbated when I finally add that variable into the chain of
>> >> variables.
>> >>
>> >> I reduced the size of the input file so that the $title contains one
>> >> small line of text in order to get an idea on the profiling however
>> >> the processing does not complete.
>> >>
>> >> I'll have to talk to my client later today before posting the full
code.
>> >>
>> >> Thanks
>> >> Alex
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> Alex
>> >>
>> >>
>> >> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>> >>>  I don't know - they are both, I think, using the Java regular
expression
>> >>> engine underneath. It may be a function of how you are measuring it.
It
>> >>> could be that the cost is dominated not by the cost of evaluating the
regex,
>> >>> but by the cost of checking that it conforms to the XPath rules. Did
you run
>> >>> a Java profile to determine where the time is being spent?
>> >>>
>> >>> Michael Kay
>> >>> Saxonica
>> >>>
>> >>> On 27/09/2010 7:21 PM, Alex Muir wrote:
>> >>>>
>> >>>> HI,
>> >>>>
>> >>>> I'm unable to figure out why this regex is so very time consuming
such
>> >>>> that it does not end in oxygen but works quickly in regex buddy on
the
>> >>>> same content.
>> >>>>
>> >>>>     <xsl:variable name="BraketedTextRemoved"
>> >>>>        select="replace($title,'\([^\)]*\)','')"/>
>> >>>>
>> >>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd
)
>> >>>>
>> >>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0"
>> >>>>
>> >>>> Any Ideas?
>> >>>>
>> >>>> Thanks
>> >>>> Alex

Current Thread