[xsl] Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?

Subject: [xsl] Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Tue, 23 Jan 2007 15:05:32 +0100
Hi Xslt'ers,

I happen to run into an XSLT 2 behavior that I am uncertain of if it is desired or expected behavior of implementations, if it is undesired, but still expected, or if it should be prohibited.

Let me explain. Backtracking is a way that most regular expression engines work with, where the engine tries to do a partial match on a largest possible string and then "tracks back" to make the whole string matching. But let me skip the details of this process...

Suppose you have a regular expression that does not match the target string. With backtracking, unmatching strings take the longest processing time, which is normal and desirable for several reasons. What happens is that the engine tracks back some positions on the string and tries to apply the regex again. The number of positions to step back is roughly equivalent to the smallest possible match.

Now, suppose this smallest possible match is zero length. In this situation, with classic regex parsers, the engine will try forever on a non-matching string (or at least very long). In XSLT, this behavior seems to happen even when the "smallest possible match" is of non-zero length.

In XSLT this is shown as follows, and I use the example of matching a CSV quoted string which may escape the quote by doubling it:
Good examples:
"well quoted csv string"
"well ""quoted"" csv string


Bad example:
not well "quoted csv string

The regular expression to match a quoted CSV string is
"([^"]+|"")*"

This gives trouble in the following XSLT (showing a non-matching string), which runs for a very long time (exponential performance chart)

<xsl:variable name="ltr">
   before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>

<xsl:analyze-string select="$ltr" regex="{$regex}">
   <xsl:matching-substring>
      <found><xsl:value-of select="." /></found>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <not-found><xsl:value-of select="." /></not-found>
   </xsl:non-matching-substring>
</xsl:analyze-string>


The above example ran for > 10 minutes before I cancelled it (it did not give any problem with matching strings, but did give problems with strings that have a large non-matching part with a quote in it). A simpler regex (but less useful) that shows the same behavior: "([^"]+)*"


I am under the impression that such behavior is not desirable, but I am unsure if there is anything in the specs that says something about how implementations should/must deal with this. As a comparison, I tried the example with Perl, which gave no noticeable performance troubles.

I would like to add as a side note the danger of repeating empty matching regular expressions. If I were to write the above regular expression as follows:

([^"]*|("")*)*

then the subexpression [^"]*|("")* can match an empty string. Repeating that indefinitely (* quantifier), causes regex engines to lock up (but with Perl: no problem). This is a known problem with regexes and with careful regex crafting it need not happen.

However, in the example above, I use a subexpression that never matches an empty string and as such should not happen to fall into the eternal-loop problem.

I use Saxon 8.8 most of the time, with Java 1.5. I tried with Altova 2006, but it does not allow xsl:analyze-string...

Note that it may or may not be possible to optimize the regex in such a way that it fails earlier on the given string. But my problem is with the (imo) unpredictability of the performance with any less-then-very-trivial regex.



Cheers,
-- Abel Braaksma
  http://www.nuntia.nl

Current Thread