Hi Xslt'ers,
I happen to run into an XSLT 2 behavior that I am uncertain of if it is 
desired or expected behavior of implementations, if it is undesired, but 
still expected, or if it should be prohibited.
Let me explain. Backtracking is a way that most regular expression 
engines work with, where the engine tries to do a partial match on a 
largest possible string and then "tracks back" to make the whole string 
matching. But let me skip the details of this process...
Suppose you have a regular expression that does not match the target 
string. With backtracking, unmatching strings take the longest 
processing time, which is normal and desirable for several reasons. What 
happens is that the engine tracks back some positions on the string and 
tries to apply the regex again. The number of positions to step back is 
roughly equivalent to the smallest possible match.
Now, suppose this smallest possible match is zero length. In this 
situation, with classic regex parsers, the engine will try forever on a 
non-matching string (or at least very long). In XSLT, this behavior 
seems to happen even when the "smallest possible match" is of non-zero 
length.
In XSLT this is shown as follows, and I use the example of matching a 
CSV quoted string which may escape the quote by doubling it:
Good examples:
"well quoted csv string"
"well ""quoted"" csv string
Bad example:
not well "quoted csv string
The regular expression to match a quoted CSV string is
"([^"]+|"")*"
This gives trouble in the following XSLT (showing a non-matching 
string), which runs for a very long time (exponential performance chart)
<xsl:variable name="ltr">
   before " after and some more text after
</xsl:variable>
<xsl:variable name="regex">"([^"]+|"")*"</xsl:variable>
<xsl:analyze-string select="$ltr" regex="{$regex}">
   <xsl:matching-substring>
      <found><xsl:value-of select="." /></found>
   </xsl:matching-substring>
   <xsl:non-matching-substring>
       <not-found><xsl:value-of select="." /></not-found>
   </xsl:non-matching-substring>
</xsl:analyze-string>
The above example ran for > 10 minutes before I cancelled it (it did not 
give any problem with matching strings, but did give problems with 
strings that have a large non-matching part with a quote in it). A 
simpler regex (but less useful) that shows the same behavior: "([^"]+)*"
I am under the impression that such behavior is not desirable, but I am 
unsure if there is anything in the specs that says something about how 
implementations should/must deal with this. As a comparison, I tried the 
example with Perl, which gave no noticeable performance troubles.
I would like to add as a side note the danger of repeating empty 
matching regular expressions. If I were to write the above regular 
expression as follows:
([^"]*|("")*)*
then the subexpression [^"]*|("")* can match an empty string. Repeating 
that indefinitely (* quantifier), causes regex engines to lock up (but 
with Perl: no problem). This is a known problem with regexes and with 
careful regex crafting it need not happen.
However, in the example above, I use a subexpression that never matches 
an empty string and as such should not happen to fall into the 
eternal-loop problem.
I use Saxon 8.8 most of the time, with Java 1.5. I tried with Altova 
2006, but it does not allow xsl:analyze-string...
Note that it may or may not be possible to optimize the regex in such a 
way that it fails earlier on the given string. But my problem is with 
the (imo) unpredictability of the performance with any 
less-then-very-trivial regex.
Cheers,
-- Abel Braaksma
  http://www.nuntia.nl