Subject: Re: [xsl] Why is the variable and regex slow in saxon and fast in regex Buddy? From: Alex Muir <alex.g.muir@xxxxxxxxx> Date: Tue, 28 Sep 2010 20:43:30 +0000 |
Well turns out the problem was a combination of factors but was the following regex which depending on the input given by the other 4 variables would non terminate or run fast or slow... I suppose what was confusing me the most was that for most files I process it was running quickly and removing one or another variable led to improvements just because of chance given the input files. and matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]{0,}\s*?)+$') I wrote this instead and (matches($titleStopWordsRemoved,'^([A-Z][A-Za-z]+\s+)+?([A-Z][A-Za-z]+?)$') or matches($titleStopWordsRemoved,'^[A-Z][A-Za-z]+\s*$'))"> The first one looks for title or upper case words and the second just one word. I see now from the profileroutput makes that clear given that >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") Takes so long and the calls below it take so little time. > > >99.92 % - 3 ms - 0.03 % - 1 inv. xsl:template (match="chunk") >s> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="title") >s #> > >99.89 % - 0 ms - 0.0 % - 1 inv. let (name="titleBraketedTextRemoved") >s # #> > >99.89 % - 2 ms - 0.02 % - 1 inv. let (name="titleNumberRemoved") >s # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStripPunctuation") >s # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. let (name="titleStopWordsRemoved") >s # # # # #> > >99.86 % - 0 ms - 0.0 % - 1 inv. xsl:choose >s # # # # # #> > >99.83 % - 14026 ms - 99.67 % - 1 inv. function-call (name="matches") >s # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="normalize-space") >s # # # # # # # #> > >0.16 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:removeStopwords") >s # # # # # # # # #> > >0.15 % - 0 ms - 0.0 % - 1 inv. xsl:function (name="mh:removeStopwords") (as="xs:string?") >s # # # # # # # # #> > >0.0 % - 0 ms - 0.0 % - 1 inv. function-call (name="mh:stripPunctuation") >s # # # # # #> > >0.02 % - 0 ms - 0.01 % - 1 inv. noMatch >s # # # # # #> > >0.01 % - 0 ms - 0.0 % - 1 inv. function-call (name="not") >s # # #> > >0.02 % - 0 ms - 0.0 % - 1 inv. function-call (name="replace") > > >0.03 % - 3 ms - 0.03 % - 1 inv. xsl:variable (name="stopwords") (select=" ('a', 'an', 'and', 'is', 'as', 'at', 'be', 'been', 'before', 'between', 'both', 'but', 'by', 'for', 'from', 'in', 'into', 'of', 'on', 'or', 'other', 'per', 'such ', 'than', 'that', 'the', 'these', 'this', 'to' , 'Q')" ) Thanks Much Alex On Tue, Sep 28, 2010 at 4:43 PM, Wolfgang Laun <wolfgang.laun@xxxxxxxxx> wrote: > Two comments, which may not shed any light on the non-termination, but anyway: > > First, the pattern "\([^\)]*\)" is supposed to remove any > parenthesized text, but there's > no point in using "[^\)]" since the set of "any character except ')' > is simply denoted > by "[^)]" becaue a parenthesis is not a meta-character within brackets. > > Second, to remove all characters of a kind (single character or class) > it's better > form to use a repetition, e.g., "\d+" rather than just "\d". > > -W > > > On 28 September 2010 14:44, Alex Muir <alex.g.muir@xxxxxxxxx> wrote: >> Hi, >> >> I found something quite interesting which may help further understand the issue. >> >> Independently none of the following variable takes long to process, >> such that when I no longer chain the variables together but just run >> the template calling only one variable and comment out the others the >> time to run is short. >> >> <xsl:variable name="title" >> select="mh:stripTextNewline(normalize-space(.))"/> >> >> <xsl:variable name="titleBraketedTextRemoved" >> select="replace($title,'\([^\)]*\)','')"/> >> >> <xsl:variable name="titleNumberRemoved" >> select="replace($titleBraketedTextRemoved,'\d','')"/> >> >> <xsl:variable name="titleStripPunctuation" >> select="mh:stripPunctuation($titleNumberRemoved)"/> >> >> <xsl:variable name="titleStopWordsRemoved" >> select="normalize-space(mh:removeStopwords($titleStripPunctuation,$stopwords) )"/> >> >> As the variables are combined together they take more and more time to >> execute and finally if all together they do not stop running. >> >> So initially I was wrong to suggest that the titleBraketedTextRemoved >> variable was causing the problem. It's just that the problem is >> exacerbated when I finally add that variable into the chain of >> variables. >> >> I reduced the size of the input file so that the $title contains one >> small line of text in order to get an idea on the profiling however >> the processing does not complete. >> >> I'll have to talk to my client later today before posting the full code. >> >> Thanks >> Alex >> >> >> >> >> >> >> >> Alex >> >> >> On Mon, Sep 27, 2010 at 7:54 PM, Michael Kay <mike@xxxxxxxxxxxx> wrote: >>> I don't know - they are both, I think, using the Java regular expression >>> engine underneath. It may be a function of how you are measuring it. It >>> could be that the cost is dominated not by the cost of evaluating the regex, >>> but by the cost of checking that it conforms to the XPath rules. Did you run >>> a Java profile to determine where the time is being spent? >>> >>> Michael Kay >>> Saxonica >>> >>> On 27/09/2010 7:21 PM, Alex Muir wrote: >>>> >>>> HI, >>>> >>>> I'm unable to figure out why this regex is so very time consuming such >>>> that it does not end in oxygen but works quickly in regex buddy on the >>>> same content. >>>> >>>> <xsl:variable name="BraketedTextRemoved" >>>> select="replace($title,'\([^\)]*\)','')"/> >>>> >>>> I'm just trying to remove content with brackets ( dfd234**#*$#*$#fdfd ) >>>> >>>> Running on vendor="SAXON 9.2.0.6 from Saxonica" version="2.0" >>>> >>>> Any Ideas? >>>> >>>> Thanks >>>> Alex
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Why is the variable and r, Wolfgang Laun | Thread | Re: [xsl] Why is the variable and r, Wolfgang Laun |
Re: [xsl] Why is the variable and r, Wolfgang Laun | Date | Re: [xsl] Making nested elements fr, Russell Urquhart |
Month |