Re: [xsl] How to make this script faster

Subject: Re: [xsl] How to make this script faster
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 15 Nov 2007 23:06:16 +0100
Hi Mathieu,

From just looking at your stylesheet, I noticed a couple of things, but I don't know whether changes will make it faster. You didn't specify what processor you use. If you use AltovaXML, it can at times be extremely slow (exponential performance) and it is worthwhile to try your code with a more optimized processor like Saxon.

From the code I notice that you use XSLT 2.0, which can usually be more easily optimized than XSLT 1.0, both in code (tail recursion and using "as" attributes to specify result types) and in the processor, because the language allows for easier optimizations of common tasks (like regular expressions instead of recursive templates).

But you still seem to use a lot of XSLT 1.0 techniques where I would prefer the 2.0 version. Consider putting your xsl:call-template (named templates) in an xsl:function (even recursively). Consider using if(value) then .... else ... instead of xsl:if or xsl:choose. Consider using matching templates instead of xsl:when etc, which may perform faster.

But your main points of performance penalties lie in the fact of passing on the following-sibling axis and walking it one by one. You can do this same trick with matching templates alone, and you are probably better off using keys to optimize performance, or to introduce a for-each or a for-each-group. Anything is better than the recursive named template.

If that does not improve things, you should have a look at some of the backtracking problems your regular expression will cause. The regular expression parser used by Saxon is the same as the one from Java and it has quite a bad performance when it comes to quadratic backtracking (of the form: (x+)+). I haven't looked into it enough, but if you can rewrite it for less backtracking, or optimize the regex to match the most common situation, or even pass it on in a doubly nested (awkward, I know, but hey you are optimizing for speed) xsl:analyze-string then you may profit a lot for speed.

It is hard to predict the behavior of a regular expression. I once made a very simple regular expression for matching CSV records which took exponential performance when the overall match for the CSV line failed (i.e., non-matching quote pair). This regex took about 1.5 hour for a string of 60 characters (and it doubled for each extra 3 characters, this regex is somewhere on the Saxon list)! Rewriting it for less backtracking improved the performance to linear.

If the regex is indeed the problem (test is with something straightforward) then I suggest you read the regex optimizing chapter in Jeffrey Friedl's now famous book on regular expressions.

HTH,
Cheers,
-- Abel Braaksma

PS: not all hints above will necessarily or predictably improve performance
PPS: you do not need the namespace for the XPath functions, after all, for some functions you do use the fn: prefix, for others you don't... You can just leave it out.



Mathieu Malaterre wrote:
Hi there,

  I have a working version of an XSLT script:
http://gdcm.svn.sourceforge.net/viewvc/gdcm/Sandbox/xslt/2/

See (*) and (**). What I would like to do is :

1. Be able to run the xslt in one pass. For now I have to run it with
<xsl:param name="extract-section" select="'C.1'"/>
then edit test.xsl file, comment the line and uncomment:
<-xsl:param name="extract-section" select="'C.2'"/>
and so on and so forth...

2. This script is seriously *slow*. I guess runnning it in one pass
should solve most of the issue, but if there was something obvious I
was missing... thanks !

-Mathieu

(*)
$ cat test.xml
<?xml version="1.0"?>
<article>
  <para>C.1 Title 1</para>
  <para>info for section C.1</para>
  <informaltable>table1</informaltable>
  <para>C.2 Title 2</para>
  <informaltable>table2</informaltable>
  <para>info for section C.2</para>
  <para>C.2.1 Title 2.1</para>
  <para>text for section C.2.1</para>
  <para>text for section C.2.1 again</para>
  <para>C.2.2 Tile 2.2</para>
  <informaltable>table for 2.2</informaltable>
  <para>text for section C.2.2</para>
</article>

(**)
$ cat test.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:fn="http://www.w3.org/2005/xpath-functions"; version="2.0">

<!-- GENERAL -->

<xsl:output method="xml" indent="yes" encoding="UTF-8"/>

<!-- number of the sample section to be extracted -->
<!--xsl:param name="extract-section" select="'C.1'"/-->
<!--xsl:param name="extract-section" select="'C.2'"/-->
<!--xsl:param name="extract-section" select="'C.2.1'"/-->
<xsl:param name="extract-section" select="'C.2.2'"/>


<xsl:template match="para"> <text> <xsl:value-of select="concat(.,'&#10;')"/> </text> </xsl:template>

<xsl:template match="informaltable">
<table>
<xsl:value-of select="concat(.,'&#10;')"/>
</table>
</xsl:template>

<!-- MAIN -->

<xsl:template match="/article">
  <xsl:variable name="section-number" select="concat($extract-section,' ')"/>
  <xsl:variable name="section-anchor"
select="para[starts-with(normalize-space(.),$section-number)]"/>
  <xsl:variable name="section-name"
select="substring-after(para[starts-with(normalize-space(.),$section-number)],$extract-section)"/>
  <xsl:choose>
    <xsl:when test="count($section-anchor)=1">
      <xsl:message>Info: section <xsl:value-of
select="$extract-section"/> found</xsl:message>
      <xsl:element name="section">
        <xsl:attribute name="ref" select="$extract-section"/>
        <xsl:attribute name="name" select="normalize-space($section-name)"/>
        <xsl:call-template name="copy-section-paragraphs">
          <xsl:with-param name="section-paragraphs"
select="$section-anchor/following-sibling::*"/>
        </xsl:call-template>
      </xsl:element>
      <xsl:message>Info: all paragraphs extracted</xsl:message>
    </xsl:when>
    <xsl:when test="count($section-anchor)>1">
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> found multiple times!</xsl:message>
    </xsl:when>
    <xsl:otherwise>
      <xsl:message>Error: section <xsl:value-of
select="$extract-section"/> not found!</xsl:message>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

<!-- TEMPLATES -->

<xsl:template name="copy-section-paragraphs">
  <xsl:param name="section-paragraphs"/>
  <xsl:variable name="current-paragraph" select="$section-paragraphs[1]"/>
  <!-- search for next section title -->
  <xsl:if test="($current-paragraph[name()='para' or
name()='informaltable']) and
not(fn:matches(normalize-space($current-paragraph),'^([A-F]|[1-9]+[0-9]?)(\.[1-9]?[0-9]+)+
'))">
    <!-- output current paragraph (close with a newline) -->
    <xsl:apply-templates select="$current-paragraph"/>
    <xsl:call-template name="copy-section-paragraphs">
      <xsl:with-param name="section-paragraphs"
select="$section-paragraphs[position()>1]"/>
    </xsl:call-template>
  </xsl:if>
</xsl:template>

</xsl:stylesheet>

Current Thread