Re: [xsl] {} quantifiers in regex

Subject: Re: [xsl] {} quantifiers in regex
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Sun, 13 Jan 2008 02:24:47 +0100
Geert Bormans wrote:

If I change it to this (removing \d{2} in favour of \d\d)

[...]
it works

Am I overlooking something?

The regex attribute of analyze-string is an AVT. Now accolades have a special meaning in both an AVT and a regular expression and to use an accolade in any AVT without it being interpreted as the start/end of an expression is to double it. Because accolades are are use often in regexes and because their contents is usually a number, the result is not an illegal AVT:


\d{2}

is interpreted as the regular expression:

\d2

which will quite likely match sometimes and sometimes not, but not when you want it. The resulting behavior has all the features of a buggy regular expression parser which in fact is a buggy expression itself... ;)

Because I used to make this mistake often (and because escaped quotes and doubled accolades look ugly), I started to put the regular expression into a variable in all but the most trivial cases. The added benefit of this is that you can now use comments in a regular expression:

<xsl:variable name="regex" as="xs:string">
     \d        <!-- a digit -->
     {2}     <!-- must occur twice and only twice -->
</xsl:variable>
<xsl:analyze-string regex="{$regex}" flags="x">
  ...
</

Note the use of the 'x' modifier, which is necessary here. Regular expressions have the tendency to be the most unreadable of existing mini-languages, so comments and whitespace are often very welcome. The as="xs:string" is there because we don't need a document node but a string.

For the fun of it and to complete this little story, note that in the world of obfuscation a lot is possible, if you set your mind to it. If you want it and you like fun code, you *can* put comments inside a regular expression (but only inside an AVT) using the following, imo rather silly construction:

<xsl:analyze-string flags="x" regex="
      \d         {()(: a digit :)}
      {{2}}   {()(: must occur twice and only twice :)}">

The () is because an xpath cannot be an empty string. The (: and :) are, of course, the comment delimiters for an XPath 2.0 expression. I don't know about other's opinions on this, but from my point of view, this doesn't add much to readability, so I still prefer the "best practice" of putting the regex in a variable (what aids to that decision is that some XSLT 2.0 processors do not allow the smiley comments).

Cheers,
-- Abel Braaksma

Current Thread