Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?

Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?
From: Alex Muir <alex.g.muir@xxxxxxxxx>
Date: Tue, 28 Sep 2010 20:43:30 +0000
Well turns out the problem was a combination of factors but was the
following regex which depending on the input given by the other 4
variables would non terminate or run fast or slow... I suppose what
was confusing me the most was that for most files I process it was
running quickly and removing one or another variable led to
improvements just because of chance given the input files.

and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$')


I wrote this instead

   and
(matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$')
            or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))">

The first one looks for title or upper case words and the second just one
word.


I see now from the profileroutput makes that clear given that

>s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")

Takes so long and the calls below it take so little time.


> > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk")
>s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title")
>s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let
(name="titleBraketedTextRemoved")
>s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved")
>s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStripPunctuation")
>s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStopWordsRemoved")
>s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose
>s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")
>s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="normalize-space")
>s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:removeStopwords")
>s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function
(name="mh:removeStopwords") (as="xs:string?")
>s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:stripPunctuation")
>s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch
>s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call (name="not")
>s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace")
> > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords") (select="
('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before', 'between',
'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or', 'other',
'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')"
)

Thanks Much
Alex


On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
wrote:
> Two comments, which may not shed any light on the non-termination, but
anyway:
>
> First, the pattern "\([^\)]*\)" is supposed to remove any
> parenthesized text, but there's
> no point in using "[^\)]" since the set of "any character except ')'
> is simply denoted
> by "[^)]" becaue a parenthesis is not a meta-character within brackets.
>
> Second, to remove all characters of a kind (single character or class)
> it's better
> form to use a repetition, e.g.,  "\d+" rather than just "\d".
>
> -W
>
>
> On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
>> Hi,
>>
>> I found something quite interesting which may help further understand the
issue.
>>
>> Independently none of the following variable takes long to process,
>> such that when  I no longer chain the variables together but just run
>> the template calling only one variable and comment out the others the
>> time to run is short.
>>
>>   <xsl:variable name="title"
>>       select="mh:stripTextNewline(normalize-space(.))"/>
>>
>>     <xsl:variable name="titleBraketedTextRemoved"
>>       select="replace($title,'\([^\)]*\)','')"/>
>>
>>     <xsl:variable name="titleNumberRemoved"
>>       select="replace($titleBraketedTextRemoved,'\d','')"/>
>>
>>     <xsl:variable name="titleStripPunctuation"
>>       select="mh:stripPunctuation($titleNumberRemoved)"/>
>>
>>     <xsl:variable name="titleStopWordsRemoved"
>>      
select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords)
)"/>
>>
>> As the variables are combined together they take more and more time to
>> execute and finally if all together they do not stop running.
>>
>> So initially I was wrong to suggest that the titleBraketedTextRemoved
>> variable was causing the problem. It's just that the problem is
>> exacerbated when I finally add that variable into the chain of
>> variables.
>>
>> I reduced the size of the input file so that the $title contains one
>> small line of text in order to get an idea on the profiling however
>> the processing does not complete.
>>
>> I'll have to talk to my client later today before posting the full code.
>>
>> Thanks
>> Alex
>>
>>
>>
>>
>>
>>
>>
>> Alex
>>
>>
>> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>>>  I don't know - they are both, I think, using the Java regular expression
>>> engine underneath. It may be a function of how you are measuring it. It
>>> could be that the cost is dominated not by the cost of evaluating the
regex,
>>> but by the cost of checking that it conforms to the XPath rules. Did you
run
>>> a Java profile to determine where the time is being spent?
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>> On 27/09/2010 7:21 PM, Alex Muir wrote:
>>>>
>>>> HI,
>>>>
>>>> I'm unable to figure out why this regex is so very time consuming such
>>>> that it does not end in oxygen but works quickly in regex buddy on the
>>>> same content.
>>>>
>>>>     <xsl:variable name="BraketedTextRemoved"
>>>>        select="replace($title,'\([^\)]*\)','')"/>
>>>>
>>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd )
>>>>
>>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0"
>>>>
>>>> Any Ideas?
>>>>
>>>> Thanks
>>>> Alex

Current Thread