Re: [xsl] check the type of the $pattern argument to a regular expression?

Subject: Re: [xsl] check the type of the $pattern argument to a regular expression?
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Tue, 17 Apr 2007 01:36:08 +0200
Dimitre Novatchev wrote:
That it can't be
done using a regular expression is one thing, that it can't be done
with XSL-T 2.0 is incorrect, given that all one would need to check
the validity of regular expressions is a turing complete language.


In fact,  have a general purpose LR(1) parser written cmpletely in
XSLT 2.0, which will parse any input and return the sequence of
poduction used in the parse.

Provide any LR-1 grammar written in BNF and you'll get the checker.


I am unfamiliar with the term LR-1, and likely this is not what it should be, it is just a poor man's attempt to solve the OP's original inquiry: checking the validity of a regular expression. Just by glimpsing over my own solution, I see plenty of parts for improvement. But I didn't want to make it a week-long assignment, so consider it a start for testing the most common mistakes in your (Bryan's) regular expressions.


As a guide, I used the W3 XML Schema specification and the XSLT 2.0 / XPath extensions. The solution the is laid out below supports the following:

* Correct use of escapes
* Correct use of parentheses and escaped parenteses
* Escaped meta characters and correct pairs of unescaped ones (i.e., {..} and [...])
* Fully supports character classes, except for the unicode categories
* Supports backreferences, but does not check for validity of them
* Supports 'matches', but not 'replace/tokenize' (i.e., empty-matching is not considered wrong)
* Supports all sorts of greedy/reluctant quantifiers and errors if specified incorrectly


There's more to it, but I leave that as an exercise to the reader. In a nutshell, if you want to apply it, be aware that it checks the validity of the _syntax_, not the overall validity of the regular expression. That means that ' [z-a] ' validates correct, and ' (.)\2 ' will validate correct as well, but neither will compile.

Run the stylesheet against anything, or with IT "main", and it will give you the following example output:

This valid regex ' (a(\(b(c+?.d)))+ ' is asserted as:  valid
This valid regex ' ((((((.).).)+?))+) ' is asserted as:  valid
This valid regex ' a{1,12}b\{c?? ' is asserted as:  valid
This valid regex ' ([^a-z123\n]+?)\1 ' is asserted as:  valid
This invalid regex ' (a(b(c)) ' is asserted as:  invalid
This invalid regex ' [abe\l] ' is asserted as:  invalid
This invalid regex ' (a|b|c|d)\\\ ' is asserted as:  invalid
This invalid regex ' .++ ' is asserted as:  invalid


All in all, the build-up of the regex to check the regex is quite straight-forward. The only trouble is in the recursive part (which could do with some better writing). Call it as
re:is-regex(....).


Have fun with it. If you plan to add the category escapes to the list of supported features, let me know, please ;)

Cheers,
-- Abel Braaksma

PS: it _can_ be done in regular expressions alone, but then you would loose some of the lexical analysis (paired grouping parentheses, mainly).


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xs = "http://www.w3.org/2001/XMLSchema";
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:re = "http://regex";
version="2.0">
<xsl:output method="text" />
<xsl:variable name="test-re">
<re assert="valid">(a(\(b(c+?.d)))+</re>
<re assert="valid">((((((.).).)+?))+)</re>
<re assert="valid">a{1,12}b\{c??</re>
<re assert="valid">([^a-z123\n]+?)\1</re>
<re assert="invalid">(a(b(c))</re>
<re assert="invalid">[abe\l]</re>
<re assert="invalid">(a|b|c|d)\\\</re>
<re assert="invalid">.++</re>
</xsl:variable>
<xsl:template match="/" name="main">
<xsl:apply-templates select="$test-re/*" />
</xsl:template>
<xsl:template match="re">
<xsl:value-of select="
'This', @assert, 'regex ''',
., ''' is asserted as: ',
('invalid', 'valid')[1 + xs:integer(re:is-regex(current()))],
'&#xA;'" />
</xsl:template>
<!--
zero-or-one, zero-or-more, one-or-more etc
(global because we need it in two functions)
-->
<xsl:variable name="quantifier" as="xs:string">
(
[?*+] <!-- q. mark, asterisk, plus sign -->
| <!-- or -->
\{[0-9]+,?[0-9]*\} <!-- {.. , ..} quantifier -->
)\?? <!-- optional reluctant quantifier -->
</xsl:variable>
<!--
Test a regex. It is first 'normalized' by removing any
whitespace. This may inadvertently make a valid regex
invalid if an escaped whitespace is part of the regex.
-->
<xsl:function name="re:test-partial-regex" as="xs:boolean">
<xsl:param name="re" as="xs:string" />
<!-- a single char, not a meta char -->
<xsl:variable name="char">
[^.\\?*+{}()|^$\[\]]
</xsl:variable>
<xsl:variable name="single-char-escape">
\\[nrt\\|.?*+(){}^\[\]$\-]
</xsl:variable>
<xsl:variable name="multi-char-escape">
(\\[sSiIcCdDwW] | \. | \\[0-9]+)
</xsl:variable>
<xsl:variable name="char-class-escape">
<xsl:text>(</xsl:text>
<xsl:sequence select="$single-char-escape" />
<xsl:text>|</xsl:text>
<xsl:sequence select="$multi-char-escape" />
<xsl:text>)</xsl:text>
</xsl:variable>
<!-- a single meta char -->
<xsl:variable name="meta-char">
[nrt\\|.?*+(){}^\[\]$\sSiIcCdDwW-]
</xsl:variable>
<!-- a char that is allowed inside a character class expression -->
<xsl:variable name="xml-char">
[^\\\[\]\-]
</xsl:variable>
<!-- base 'char' of a character class expression -->
<xsl:variable name="char-or-esc">
<xsl:text>(</xsl:text>
<xsl:sequence select="$xml-char" />
<xsl:text>|</xsl:text>
<xsl:sequence select="$single-char-escape" />
<xsl:text>)</xsl:text>
</xsl:variable>
<!-- range in a character class expression -->
<xsl:variable name="char-range">
<xsl:text></xsl:text>
<xsl:sequence select="$char-or-esc" />
<xsl:text>-</xsl:text>
<xsl:sequence select="$char-or-esc" />
</xsl:variable>
<!-- a character class expression -->
<!-- all that can be between [ and ] -->
<xsl:variable name="char-class-expr">
\[
(\^|-)?
<xsl:text>(</xsl:text>
<xsl:sequence select="$char-or-esc" />
<xsl:text>|</xsl:text>
<xsl:sequence select="$char-range" /> <xsl:text>)+</xsl:text>
\]
</xsl:variable>
<!-- a character class is an escaped meta char, a char class expr or a wildcard expr -->
<xsl:variable name="char-class">
<xsl:text>(</xsl:text>
<xsl:sequence select="$char-class-escape" />
<xsl:text>|</xsl:text>
<xsl:sequence select="$char-class-expr" />
<xsl:text>)</xsl:text>
</xsl:variable>
<!-- a char or an escaped meta char -->
<xsl:variable name="atom" as="xs:string+">
<xsl:text>(</xsl:text>
<xsl:sequence select="$char" />
<xsl:text>|</xsl:text>
<xsl:sequence select="$char-class" />
<xsl:text>)</xsl:text>
</xsl:variable>
<!-- a piece has an atom and optional quantifier -->
<xsl:variable name="piece">
<xsl:text>(</xsl:text>
<xsl:sequence select="$atom" />
<xsl:text>)</xsl:text>
<xsl:text>(</xsl:text>
<xsl:sequence select="$quantifier" />
<xsl:text>)?</xsl:text>
</xsl:variable>
<!-- a branch has zero or more pieces
(here: 1 or more, for optimizing the matching) -->
<xsl:variable name="branch">
<xsl:text>(</xsl:text>
<xsl:sequence select="$piece" />
<xsl:text>)+</xsl:text> </xsl:variable>


<!-- a regexp has one or more branches -->
<xsl:variable name="regexp" >
<xsl:text>^$|^</xsl:text> <xsl:sequence select="$branch" />
<xsl:text>(\|</xsl:text>
<xsl:sequence select="$branch" />
<xsl:text>)*</xsl:text>
<xsl:text>$</xsl:text> </xsl:variable>
<xsl:sequence select="matches($re, $regexp, 'x')" />
</xsl:function>
<xsl:function name="re:is-regex" as="xs:boolean">
<xsl:param name="re" as="xs:string" />
<!--
variable with xs:boolean and xs:string results, the first
contains results of inner regexes (inside parentheses) the
second the remaining parenthesized string, which will
recursively be re-applied
-->
<xsl:variable name="result-seq" as="xs:anyAtomicType*">
<!-- remove the escaped parentheses before applying regex -->
<xsl:analyze-string select="replace($re, '\\\(|\\\)', '')" regex="\(([^()]*)\)({$quantifier})?" flags="x">
<xsl:matching-substring>
<!-- test this deepest level -->
<xsl:sequence select="re:test-partial-regex(regex-group(1))" />
</xsl:matching-substring>
<xsl:non-matching-substring>
<xsl:sequence select="xs:string(.)" />
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:variable>
<xsl:sequence select="
(: if a non-regex partial match is found, exit early :)
if(some $b in $result-seq satisfies $b instance of xs:boolean and xs:boolean($b) = false())
then false()
else
(: last string, no parentheses anymore :)
if(count($result-seq) = 1 and $result-seq[1] instance of xs:string)
then re:test-partial-regex($result-seq[1])
else
(: more sub regexes available still, recursively apply :)
if($result-seq[. instance of xs:string][1])
then re:is-regex(string-join($result-seq[not(. instance of xs:boolean)], '')) else every $b in $result-seq satisfies xs:boolean($b) = true()" />
</xsl:function>
</xsl:stylesheet>


Current Thread