Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?

Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy?
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Wed, 29 Sep 2010 12:10:26 +0200
On 28 September 2010 22:43, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
>
> Well turns out the problem was a combination of factors but was the
> following regex which depending on the input given by the other 4
> variables would non terminate or run fast or slow... I suppose what
> was confusing me the most was that for most files I process it was
> running quickly and removing one or another variable led to
> improvements just because of chance given the input files.
>
> and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$')

This is really evil, as it will backtrack exponentially. The "\s*"
isn't providing
separation between words; a "\s+" should do the trick. "{0,} isn't wrong,
but why not use "*"?

>
>
> I wrote this instead
>
>   and
(matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$')
>            or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))">
>

I don't see the point of using two expressions, or "+?".

To match a string consisting entirely of capitalized words sparated by
white space:

    ^[A-Z][A-Za-z]*(\s+[A-Z][A-Za-z]*)*$

You may add \s* at the end to handle optional trailing white space.

-W

> The first one looks for title or upper case words and the second just one
word.
>
>
> I see now from the profileroutput makes that clear given that
>
> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")
>
> Takes so long and the calls below it take so little time.
>
>
> > > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk")
> >s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title")
> >s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let
(name="titleBraketedTextRemoved")
> >s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved")
> >s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStripPunctuation")
> >s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let
(name="titleStopWordsRemoved")
> >s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose
> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call
(name="matches")
> >s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="normalize-space")
> >s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:removeStopwords")
> >s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function
(name="mh:removeStopwords") (as="xs:string?")
> >s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call
(name="mh:stripPunctuation")
> >s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch
> >s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call
(name="not")
> >s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace")
> > > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords")
(select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before',
'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or',
'other', 'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')"
> )
>
> Thanks Much
> Alex
>
>
> On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
wrote:
> > Two comments, which may not shed any light on the non-termination, but
anyway:
> >
> > First, the pattern "\([^\)]*\)" is supposed to remove any
> > parenthesized text, but there's
> > no point in using "[^\)]" since the set of "any character except ')'
> > is simply denoted
> > by "[^)]" becaue a parenthesis is not a meta-character within brackets.
> >
> > Second, to remove all characters of a kind (single character or class)
> > it's better
> > form to use a repetition, e.g.,  "\d+" rather than just "\d".
> >
> > -W
> >
> >
> > On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote:
> >> Hi,
> >>
> >> I found something quite interesting which may help further understand the
issue.
> >>
> >> Independently none of the following variable takes long to process,
> >> such that when  I no longer chain the variables together but just run
> >> the template calling only one variable and comment out the others the
> >> time to run is short.
> >>
> >>   <xsl:variable name="title"
> >>       select="mh:stripTextNewline(normalize-space(.))"/>
> >>
> >>     <xsl:variable name="titleBraketedTextRemoved"
> >>       select="replace($title,'\([^\)]*\)','')"/>
> >>
> >>     <xsl:variable name="titleNumberRemoved"
> >>       select="replace($titleBraketedTextRemoved,'\d','')"/>
> >>
> >>     <xsl:variable name="titleStripPunctuation"
> >>       select="mh:stripPunctuation($titleNumberRemoved)"/>
> >>
> >>     <xsl:variable name="titleStopWordsRemoved"
> >>      
select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords)
)"/>
> >>
> >> As the variables are combined together they take more and more time to
> >> execute and finally if all together they do not stop running.
> >>
> >> So initially I was wrong to suggest that the titleBraketedTextRemoved
> >> variable was causing the problem. It's just that the problem is
> >> exacerbated when I finally add that variable into the chain of
> >> variables.
> >>
> >> I reduced the size of the input file so that the $title contains one
> >> small line of text in order to get an idea on the profiling however
> >> the processing does not complete.
> >>
> >> I'll have to talk to my client later today before posting the full code.
> >>
> >> Thanks
> >> Alex
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Alex
> >>
> >>
> >> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
> >>>  I don't know - they are both, I think, using the Java regular
expression
> >>> engine underneath. It may be a function of how you are measuring it. It
> >>> could be that the cost is dominated not by the cost of evaluating the
regex,
> >>> but by the cost of checking that it conforms to the XPath rules. Did you
run
> >>> a Java profile to determine where the time is being spent?
> >>>
> >>> Michael Kay
> >>> Saxonica
> >>>
> >>> On 27/09/2010 7:21 PM, Alex Muir wrote:
> >>>>
> >>>> HI,
> >>>>
> >>>> I'm unable to figure out why this regex is so very time consuming such
> >>>> that it does not end in oxygen but works quickly in regex buddy on the
> >>>> same content.
> >>>>
> >>>>     <xsl:variable name="BraketedTextRemoved"
> >>>>        select="replace($title,'\([^\)]*\)','')"/>
> >>>>
> >>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd
)
> >>>>
> >>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0"
> >>>>
> >>>> Any Ideas?
> >>>>
> >>>> Thanks
> >>>> Alex

Current Thread