Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?

Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Tue, 28 Sep 2010 18:43:44 +0200
Two comments, which may not shed any light on the non-termination, but
anyway:

First, the pattern "\([^\)]*\)" is supposed to remove any
parenthesized text, but there's
no point in using "[^\)]" since the set of "any character except ')'
is simply denoted
by "[^)]" becaue a parenthesis is not a meta-character within brackets.

Second, to remove all characters of a kind (single character or class)
it's better
form to use a repetition, e.g.,  "\d+" rather than just "\d".

-W


On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
> Hi,
>
> I found something quite interesting which may help further understand the
issue.
>
> Independently none of the following variable takes long to process,
> such that when  I no longer chain the variables together but just run
> the template calling only one variable and comment out the others the
> time to run is short.
>
>   <xsl:variable name="title"
>       select="mh:stripTextNewline(normalize-space(.))"/>
>
>     <xsl:variable name="titleBraketedTextRemoved"
>       select="replace($title,'\([^\)]*\)','')"/>
>
>     <xsl:variable name="titleNumberRemoved"
>       select="replace($titleBraketedTextRemoved,'\d','')"/>
>
>     <xsl:variable name="titleStripPunctuation"
>       select="mh:stripPunctuation($titleNumberRemoved)"/>
>
>     <xsl:variable name="titleStopWordsRemoved"
>      
select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords)
)"/>
>
> As the variables are combined together they take more and more time to
> execute and finally if all together they do not stop running.
>
> So initially I was wrong to suggest that the titleBraketedTextRemoved
> variable was causing the problem. It's just that the problem is
> exacerbated when I finally add that variable into the chain of
> variables.
>
> I reduced the size of the input file so that the $title contains one
> small line of text in order to get an idea on the profiling however
> the processing does not complete.
>
> I'll have to talk to my client later today before posting the full code.
>
> Thanks
> Alex
>
>
>
>
>
>
>
> Alex
>
>
> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>>  I don't know - they are both, I think, using the Java regular expression
>> engine underneath. It may be a function of how you are measuring it. It
>> could be that the cost is dominated not by the cost of evaluating the
regex,
>> but by the cost of checking that it conforms to the XPath rules. Did you
run
>> a Java profile to determine where the time is being spent?
>>
>> Michael Kay
>> Saxonica
>>
>> On 27/09/2010 7:21 PM, Alex Muir wrote:
>>>
>>> HI,
>>>
>>> I'm unable to figure out why this regex is so very time consuming such
>>> that it does not end in oxygen but works quickly in regex buddy on the
>>> same content.
>>>
>>>     <xsl:variable name="BraketedTextRemoved"
>>>        select="replace($title,'\([^\)]*\)','')"/>
>>>
>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd )
>>>
>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0"
>>>
>>> Any Ideas?
>>>
>>> Thanks
>>> Alex

Current Thread