Re: [xsl] Comments in XPath / XSLT regular expressions?

Subject: Re: [xsl] Comments in XPath / XSLT regular expressions?
From: Abel Braaksma Online <abel.online@xxxxxxxxx>
Date: Wed, 26 Jul 2006 11:44:15 +0200
Hi Michael,

Thanks for your advice on this. I understand now the reasons behind not allowing newline-ended comments. After reading Colin Adams comments, and Frans Englich's, I came up with the following as a best-practice for ourselves:

<xsl:variable name="re-extract-filename" >
   ^.*?              <!-- non-greedy: grab everything -->
   ([^/\\]+)         <!-- filename in $1  -->
   \.                <!-- extension-dot  -->
   [^\.]*$           <!-- extension (not-a-dot*)  -->
</xsl:variable>
<xsl:value-of select="replace(., $re-extract-filename, '$1.xml', 'x')" />

I see several benefits on this approach:
1. It allows for dissecting the regex into smaller parts
2. The comments are understood by most xml/xslt editors (smileys are not)
3. You have all freedom when it comes to whitespace
4. Using a good variable name works like a function: it tells what the regex does
5. One may create a lib of time tested regexes.


Michael, on your comment about readability, I'd like to add that I agree that regexes are hard to read. Even harder to learn and perhaps hardest to really master. I always recommend Jeffrey Friedl's excellent book to my programmers.

Unfortunately, it is often impossible to just cut a regular expression into several steps. And I don't agree that it adds to readability by adding more steps to it (sometimes it does, when the steps are clear, I guess, but what if there aren't logical steps). Finally, I think that even the simplest regular expression can be hard to read when not commented, and the simplest are often quite long already.

Using the above (non-foolproof) "simple" regex, it could be dissected into steps like: removeprotocol, removesite, removeport, removepath, removeextension. But that would add to a lot of verbosity to the xslt / xpath. My men are not very experienced when it comes to regexes. They find it very hard to understand the flaws of the above regex. Using comments, they grab the idea a lot better.

In terms of performance, I think (but am not sure) that a well crafted regex is often a lot quicker and less resource intensive. Yet, a good understanding of how regex parsers work is a necessity.

Cheers,
Abel





Michael Kay wrote:

As I see it, XPath 2.0 has that flag too. See XQuery 1.0 and XPath 2.0 Functions and Operators, section 7.6.1.1 Flags:


Yes, but in XPath the "x" flag does not enable comments. This is because the Perl comment syntax uses newline to mark the end of a comment. In XSLT, regular expressions will often appear inside XML attributes, where newlines get normalized to spaces by an XML parser, so we've always adopted the view that the grammar should never treat newlines differently from spaces.

My own advice is to avoid using regular expressions that are so complex that
they need comments to explain them. If you need to explain them, then it's
also going to be very hard to debug them, and if you hit performance
problems it will be very difficult to analyze the problems. If you can,
break up the task into separate stages, each defined by simpler regular
expressions.

Another approach to commenting, however, is like this:

<xsl:variable name="x" select="replace(., '^.*?([^/\\]+)\.[^\.]*$)', '$1.xml')"/>
<!-- ^non-greedy: grab everything
^the last part of the path: does not contain (back)slashes. Grab
it to $1
^the dot separating the extension from the filename ^not-a-dot until end of string, this is the extension
-->


or if you prefer:

<xsl:variable name="x" select="replace(., '^.*?([^/\\]+)\.[^\.]*$)', '$1.xml')"/>
<!-- .*? non-greedy: grab everything
([^/\\]+) the last part of the path: does not contain (back)slashes.
Grab it to $1
\. the dot separating the extension from the filename [^\.]* not-a-dot until end of string, this is the extension
-->



(Sorry if the mailer mangles this!)


Michael Kay
http://www.saxonica.com/

Current Thread