RE: [xsl] How to make this script faster

Subject: RE: [xsl] How to make this script faster
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 15 Nov 2007 23:00:27 -0000
>  From just looking at your stylesheet, I noticed a couple of 
> things, but I don't know whether changes will make it faster. 

I noticed a few stylistic things too. I hate the verbosity of

> >       <xsl:element name="section">
> >         <xsl:attribute name="ref" select="$extract-section"/>
> >         <xsl:attribute name="name"
select="normalize-space($section-name)"/>

when you could write
          <section ref="{$extract-section}"
name="{normalize-space($section-name)}"

But that's not a performance issue, and nor are most of the points Abel
made; and I have to say I couldn't see anything at all here that should
cause performance problems.

Abel might be right about the regular expression - innocent-looking regexes
can sometimes catch you out - but this one looks as if it will give a
no-match on most input lines very quickly with no backtracking needed.

So, let's have some data:

* what processor/version are you using?

* how are you running it?

* what's the size of the input data?

* how long is it actually taking?

Michael Kay
http://www.saxonica.com/


> You didn't specify what processor you use. If you use 
> AltovaXML, it can at times be extremely slow (exponential 
> performance) and it is worthwhile to try your code with a 
> more optimized processor like Saxon.
> 
>  From the code I notice that you use XSLT 2.0, which can 
> usually be more easily optimized than XSLT 1.0, both in code 
> (tail recursion and using "as" attributes to specify result 
> types) and in the processor, because the language allows for 
> easier optimizations of common tasks (like regular 
> expressions instead of recursive templates).
> 
> But you still seem to use a lot of XSLT 1.0 techniques where 
> I would prefer the 2.0 version. Consider putting your 
> xsl:call-template (named
> templates) in an xsl:function (even recursively). Consider using
> if(value) then .... else ... instead of xsl:if or xsl:choose. 
> Consider using matching templates instead of xsl:when etc, 
> which may perform faster.
> 
> But your main points of performance penalties lie in the fact 
> of passing on the following-sibling axis and walking it one 
> by one. You can do this same trick with matching templates 
> alone, and you are probably better off using keys to optimize 
> performance, or to introduce a for-each or a for-each-group. 
> Anything is better than the recursive named template.
> 
> If that does not improve things, you should have a look at 
> some of the backtracking problems your regular expression 
> will cause. The regular expression parser used by Saxon is 
> the same as the one from Java and it has quite a bad 
> performance when it comes to quadratic backtracking (of the 
> form: (x+)+). I haven't looked into it enough, but if you can 
> rewrite it for less backtracking, or optimize the regex to 
> match the most common situation, or even pass it on in a 
> doubly nested (awkward, I know, but hey you are optimizing 
> for speed) xsl:analyze-string then you may profit a lot for speed.
> 
> It is hard to predict the behavior of a regular expression. I 
> once made a very simple regular expression for matching CSV 
> records which took exponential performance when the overall 
> match for the CSV line failed (i.e., non-matching quote 
> pair). This regex took about 1.5 hour for a string of 60 
> characters (and it doubled for each extra 3 characters, this 
> regex is somewhere on the Saxon list)! Rewriting it for less 
> backtracking improved the performance to linear.
> 
> If the regex is indeed the problem (test is with something
> straightforward) then I suggest you read the regex optimizing 
> chapter in Jeffrey Friedl's now famous book on regular expressions.
> 
> HTH,
> Cheers,
> -- Abel Braaksma
> 
> PS: not all hints above will necessarily or predictably 
> improve performance
> PPS: you do not need the namespace for the XPath functions, 
> after all, for some functions you do use the fn: prefix, for 
> others you don't... 
> You can just leave it out.
> 
> 
> Mathieu Malaterre wrote:
> > Hi there,
> >
> >   I have a working version of an XSLT script:
> > http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/
> >
> >   See (*) and (**). What I would like to do is :
> >
> > 1. Be able to run the xslt in one pass. For now I have to 
> run it with
> > <xsl:param name="extract-section" select="'C.1'"/>
> > then edit test.xsl file, comment the line and uncomment:
> > <-xsl:param name="extract-section" select="'C.2'"/>
> > and so on and so forth...
> >
> > 2. This script is seriously *slow*. I guess runnning it in one pass
> > should solve most of the issue, but if there was something obvious I
> > was missing... thanks !
> >
> > -Mathieu
> >
> > (*)
> > $ cat test.xml
> > <?xml version="1.0"?>
> > <article>
> >   <para>C.1 Title 1</para>
> >   <para>info for section C.1</para>
> >   <informaltable>table1</informaltable>
> >   <para>C.2 Title 2</para>
> >   <informaltable>table2</informaltable>
> >   <para>info for section C.2</para>
> >   <para>C.2.1 Title 2.1</para>
> >   <para>text for section C.2.1</para>
> >   <para>text for section C.2.1 again</para>
> >   <para>C.2.2 Tile 2.2</para>
> >   <informaltable>table for 2.2</informaltable>
> >   <para>text for section C.2.2</para>
> > </article>
> >
> > (**)
> > $ cat test.xsl
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> > xmlns:fn="http://www.w3.org/2005/xpath-functions"; version="2.0">
> >
> > <!-- GENERAL -->
> >
> > <xsl:output method="xml" indent="yes" encoding="UTF-8"/>
> >
> > <!-- number of the sample section to be extracted -->
> > <!--xsl:param name="extract-section" select="'C.1'"/-->
> > <!--xsl:param name="extract-section" select="'C.2'"/-->
> > <!--xsl:param name="extract-section" select="'C.2.1'"/-->
> > <xsl:param name="extract-section" select="'C.2.2'"/>
> >
> >
> > <xsl:template match="para">
> > <text>
> > <xsl:value-of select="concat(.,'&#10;')"/>
> > </text>
> > </xsl:template>
> >
> > <xsl:template match="informaltable">
> > <table>
> > <xsl:value-of select="concat(.,'&#10;')"/>
> > </table>
> > </xsl:template>
> >
> > <!-- MAIN -->
> >
> > <xsl:template match="/article">
> >   <xsl:variable name="section-number" 
> select="concat($extract-section,' ')"/>
> >   <xsl:variable name="section-anchor"
> > select="para[starts-with(normalize-space(.),$section-number)]"/>
> >   <xsl:variable name="section-name"
> > 
> select="substring-after(para[starts-with(normalize-space(.),$s
> ection-number)],$extract-section)"/>
> >   <xsl:choose>
> >     <xsl:when test="count($section-anchor)=1">
> >       <xsl:message>Info: section <xsl:value-of
> > select="$extract-section"/> found</xsl:message>
> >       <xsl:element name="section">
> >         <xsl:attribute name="ref" select="$extract-section"/>
> >         <xsl:attribute name="name" 
> select="normalize-space($section-name)"/>
> >         <xsl:call-template name="copy-section-paragraphs">
> >           <xsl:with-param name="section-paragraphs"
> > select="$section-anchor/following-sibling::*"/>
> >         </xsl:call-template>
> >       </xsl:element>
> >       <xsl:message>Info: all paragraphs extracted</xsl:message>
> >     </xsl:when>
> >     <xsl:when test="count($section-anchor)>1">
> >       <xsl:message>Error: section <xsl:value-of
> > select="$extract-section"/> found multiple times!</xsl:message>
> >     </xsl:when>
> >     <xsl:otherwise>
> >       <xsl:message>Error: section <xsl:value-of
> > select="$extract-section"/> not found!</xsl:message>
> >     </xsl:otherwise>
> >   </xsl:choose>
> > </xsl:template>
> >
> > <!-- TEMPLATES -->
> >
> > <xsl:template name="copy-section-paragraphs">
> >   <xsl:param name="section-paragraphs"/>
> >   <xsl:variable name="current-paragraph" 
> select="$section-paragraphs[1]"/>
> >   <!-- search for next section title -->
> >   <xsl:if test="($current-paragraph[name()='para' or
> > name()='informaltable']) and
> > 
> not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1
> -9]+[0-9]?)(\.[1-9]?[0-9]+)+
> > '))">
> >     <!-- output current paragraph (close with a newline) -->
> >     <xsl:apply-templates select="$current-paragraph"/>
> >     <xsl:call-template name="copy-section-paragraphs">
> >       <xsl:with-param name="section-paragraphs"
> > select="$section-paragraphs[position()>1]"/>
> >     </xsl:call-template>
> >   </xsl:if>
> > </xsl:template>
> >
> > </xsl:stylesheet>

Current Thread