Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy? From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx> Date: Wed, 29 Sep 2010 12:10:26 +0200 |
On 28 September 2010 22:43, Alex Muir <alex.g.muir@xxxxxxxxx> wrote: > > Well turns out the problem was a combination of factors but was the > following regex which depending on the input given by the other 4 > variables would non terminate or run fast or slow... I suppose what > was confusing me the most was that for most files I process it was > running quickly and removing one or another variable led to > improvements just because of chance given the input files. > > and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$') This is really evil, as it will backtrack exponentially. The "\s*" isn't providing separation between words; a "\s+" should do the trick. "{0,} isn't wrong, but why not use "*"? > > > I wrote this instead > > and (matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$') > or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))"> > I don't see the point of using two expressions, or "+?". To match a string consisting entirely of capitalized words sparated by white space: ^[A-Z][A-Za-z]*(\s+[A-Z][A-Za-z]*)*$ You may add \s* at the end to handle optional trailing white space. -W > The first one looks for title or upper case words and the second just one word. > > > I see now from the profileroutput makes that clear given that > > >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") > > Takes so long and the calls below it take so little time. > > > > > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk") > >s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title") > >s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="titleBraketedTextRemoved") > >s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved") > >s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStripPunctuation") > >s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStopWordsRemoved") > >s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose > >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") > >s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="normalize-space") > >s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:removeStopwords") > >s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function (name="mh:removeStopwords") (as="xs:string?") > >s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:stripPunctuation") > >s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch > >s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call (name="not") > >s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace") > > > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords") (select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before', 'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or', 'other', 'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')" > ) > > Thanks Much > Alex > > > On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote: > > Two comments, which may not shed any light on the non-termination, but anyway: > > > > First, the pattern "\([^\)]*\)" is supposed to remove any > > parenthesized text, but there's > > no point in using "[^\)]" since the set of "any character except ')' > > is simply denoted > > by "[^)]" becaue a parenthesis is not a meta-character within brackets. > > > > Second, to remove all characters of a kind (single character or class) > > it's better > > form to use a repetition, e.g., "\d+" rather than just "\d". > > > > -W > > > > > > On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote: > >> Hi, > >> > >> I found something quite interesting which may help further understand the issue. > >> > >> Independently none of the following variable takes long to process, > >> such that when I no longer chain the variables together but just run > >> the template calling only one variable and comment out the others the > >> time to run is short. > >> > >> <xsl:variable name="title" > >> select="mh:stripTextNewline(normalize-space(.))"/> > >> > >> <xsl:variable name="titleBraketedTextRemoved" > >> select="replace($title,'\([^\)]*\)','')"/> > >> > >> <xsl:variable name="titleNumberRemoved" > >> select="replace($titleBraketedTextRemoved,'\d','')"/> > >> > >> <xsl:variable name="titleStripPunctuation" > >> select="mh:stripPunctuation($titleNumberRemoved)"/> > >> > >> <xsl:variable name="titleStopWordsRemoved" > >> select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords) )"/> > >> > >> As the variables are combined together they take more and more time to > >> execute and finally if all together they do not stop running. > >> > >> So initially I was wrong to suggest that the titleBraketedTextRemoved > >> variable was causing the problem. It's just that the problem is > >> exacerbated when I finally add that variable into the chain of > >> variables. > >> > >> I reduced the size of the input file so that the $title contains one > >> small line of text in order to get an idea on the profiling however > >> the processing does not complete. > >> > >> I'll have to talk to my client later today before posting the full code. > >> > >> Thanks > >> Alex > >> > >> > >> > >> > >> > >> > >> > >> Alex > >> > >> > >> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote: > >>> I don't know - they are both, I think, using the Java regular expression > >>> engine underneath. It may be a function of how you are measuring it. It > >>> could be that the cost is dominated not by the cost of evaluating the regex, > >>> but by the cost of checking that it conforms to the XPath rules. Did you run > >>> a Java profile to determine where the time is being spent? > >>> > >>> Michael Kay > >>> Saxonica > >>> > >>> On 27/09/2010 7:21 PM, Alex Muir wrote: > >>>> > >>>> HI, > >>>> > >>>> I'm unable to figure out why this regex is so very time consuming such > >>>> that it does not end in oxygen but works quickly in regex buddy on the > >>>> same content. > >>>> > >>>> <xsl:variable name="BraketedTextRemoved" > >>>> select="replace($title,'\([^\)]*\)','')"/> > >>>> > >>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd ) > >>>> > >>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0" > >>>> > >>>> Any Ideas? > >>>> > >>>> Thanks > >>>> Alex
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Why is the variable and r, Alex Muir | Thread | Re: [xsl] Why is the variable and r, Alex Muir |
Re: [xsl] Need StringBuffer equival, David Carlisle | Date | Re: [xsl] Need StringBuffer equival, sudheshna iyer |
Month |