Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy? From: Alex Muir <alex.g.muir@xxxxxxxxx> Date: Wed, 29 Sep 2010 13:29:08 +0000 |
Thanks wolfgang, I tried that regex in regexBuddy and it works effectively and takes less steps in matching and non matching cases. ^[A-Z][A-Za-z]+(\s+[A-Z][A-Za-z]*)*$ Thanks Alex On Wed, Sep 29, 2010 at 10:10 AM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote: > On 28 September 2010 22:43, Alex Muir <alex.g.muir@xxxxxxxxx> wrote: >> >> Well turns out the problem was a combination of factors but was the >> following regex which depending on the input given by the other 4 >> variables would non terminate or run fast or slow... I suppose what >> was confusing me the most was that for most files I process it was >> running quickly and removing one or another variable led to >> improvements just because of chance given the input files. >> >> and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$') > > This is really evil, as it will backtrack exponentially. The "\s*" > isn't providing > separation between words; a "\s+" should do the trick. "{0,} isn't wrong, > but why not use "*"? > >> >> >> I wrote this instead >> >> and (matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$') >> or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))"> >> > > I don't see the point of using two expressions, or "+?". > > To match a string consisting entirely of capitalized words sparated by > white space: > > ^[A-Z][A-Za-z]*(\s+[A-Z][A-Za-z]*)*$ > > You may add \s* at the end to handle optional trailing white space. > > -W > >> The first one looks for title or upper case words and the second just one word. >> >> >> I see now from the profileroutput makes that clear given that >> >> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") >> >> Takes so long and the calls below it take so little time. >> >> >> > > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk") >> >s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title") >> >s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="titleBraketedTextRemoved") >> >s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved") >> >s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStripPunctuation") >> >s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStopWordsRemoved") >> >s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose >> >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") >> >s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="normalize-space") >> >s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:removeStopwords") >> >s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function (name="mh:removeStopwords") (as="xs:string?") >> >s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:stripPunctuation") >> >s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch >> >s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call (name="not") >> >s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace") >> > > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords") (select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before', 'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or', 'other', 'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')" >> ) >> >> Thanks Much >> Alex >> >> >> On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote: >> > Two comments, which may not shed any light on the non-termination, but anyway: >> > >> > First, the pattern "\([^\)]*\)" is supposed to remove any >> > parenthesized text, but there's >> > no point in using "[^\)]" since the set of "any character except ')' >> > is simply denoted >> > by "[^)]" becaue a parenthesis is not a meta-character within brackets. >> > >> > Second, to remove all characters of a kind (single character or class) >> > it's better >> > form to use a repetition, e.g., "\d+" rather than just "\d". >> > >> > -W >> > >> > >> > On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote: >> >> Hi, >> >> >> >> I found something quite interesting which may help further understand the issue. >> >> >> >> Independently none of the following variable takes long to process, >> >> such that when I no longer chain the variables together but just run >> >> the template calling only one variable and comment out the others the >> >> time to run is short. >> >> >> >> <xsl:variable name="title" >> >> select="mh:stripTextNewline(normalize-space(.))"/> >> >> >> >> <xsl:variable name="titleBraketedTextRemoved" >> >> select="replace($title,'\([^\)]*\)','')"/> >> >> >> >> <xsl:variable name="titleNumberRemoved" >> >> select="replace($titleBraketedTextRemoved,'\d','')"/> >> >> >> >> <xsl:variable name="titleStripPunctuation" >> >> select="mh:stripPunctuation($titleNumberRemoved)"/> >> >> >> >> <xsl:variable name="titleStopWordsRemoved" >> >> select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords) )"/> >> >> >> >> As the variables are combined together they take more and more time to >> >> execute and finally if all together they do not stop running. >> >> >> >> So initially I was wrong to suggest that the titleBraketedTextRemoved >> >> variable was causing the problem. It's just that the problem is >> >> exacerbated when I finally add that variable into the chain of >> >> variables. >> >> >> >> I reduced the size of the input file so that the $title contains one >> >> small line of text in order to get an idea on the profiling however >> >> the processing does not complete. >> >> >> >> I'll have to talk to my client later today before posting the full code. >> >> >> >> Thanks >> >> Alex >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> Alex >> >> >> >> >> >> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote: >> >>> I don't know - they are both, I think, using the Java regular expression >> >>> engine underneath. It may be a function of how you are measuring it. It >> >>> could be that the cost is dominated not by the cost of evaluating the regex, >> >>> but by the cost of checking that it conforms to the XPath rules. Did you run >> >>> a Java profile to determine where the time is being spent? >> >>> >> >>> Michael Kay >> >>> Saxonica >> >>> >> >>> On 27/09/2010 7:21 PM, Alex Muir wrote: >> >>>> >> >>>> HI, >> >>>> >> >>>> I'm unable to figure out why this regex is so very time consuming such >> >>>> that it does not end in oxygen but works quickly in regex buddy on the >> >>>> same content. >> >>>> >> >>>> <xsl:variable name="BraketedTextRemoved" >> >>>> select="replace($title,'\([^\)]*\)','')"/> >> >>>> >> >>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd ) >> >>>> >> >>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0" >> >>>> >> >>>> Any Ideas? >> >>>> >> >>>> Thanks >> >>>> Alex
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Why is the variable and r, Wolfgang Laun | Thread | Re: [xsl] Why is the variable and r, Andrew Welch |
Re: [xsl] Need StringBuffer equival, David Carlisle | Date | Re: [xsl] Need StringBuffer equival, Michael Müller-Hille |
Month |