Re: [xsl] A better xsl:analyze-string

Subject: Re: [xsl] A better xsl:analyze-string
From: Pavel Minaev <int19h@xxxxxxxxx>
Date: Thu, 20 Aug 2009 14:59:11 -0700
On Thu, Aug 20, 2009 at 2:39 PM, Michael Kay<mike@xxxxxxxxxxxx> wrote:
> It's true that using regex-group() is a pretty messy mechanism, and it would
> be nice to do better.
>
> Things get a bit more complicated if there are groups that can match more
> than once, of the form (...)*. It's not clear how that would work with your
> proposed syntax.
>
> One of the constraints is that we want to ensure that the facilities can be
> implemented on top of popular regex libraries such as those used by Java,
> C#, or Perl. These are all very heavily based on the concept of numbered
> captured groups, with all their quirks.

I'm not sure I understand. How having (...)* would affect the syntax
or semantics in any way?

The straightforward implementation of this that I imagine is a simple
rewrite. If we have matching-substring instructions for regexes rx1,
rx2, .... rxN, the implementation rewrites it as a single regex:

  (rx1)|(rx2)|...|(rxN)

and then counts the parentheses to determine the group number of each
of the original tokens. From there it's a trivial rewrite to
choose/when form. Counting parentheses is sufficient per the spec for
regex-group() function:

"The Nth captured substring (where N > 0) is the string matched by the
subexpression contained by the Nth left parenthesis in the regex. "

So any quantifiers on groups shouldn't affect this. It would, of
course, also have to correct group number for any user call to
regex-group() from within matching-substring, but that is similarly
trivial.

By the way, as a side question - what is regex-group() supposed to
return in XSLT 2.0 at present when the corresponding subexpression
matches more than once - as it may do in (...)* case?

> I suggest you post this to the W3C bugzilla database as a comment on the
> spec, which means it will go on the WG agenda for consideration. The status
> section of the spec gives you a pointer.

I wanted to discuss it here first to see if there are any obvious
design flaws that I've missed, or other relevant scenarios that others
have encountered. The idea is to submit this as a spec comment in the
end, yes.

> I think that in many real-life cases one can solve this problem by doing two
> levels of matching. For example, you can often do it by first tokenizing
> with space as a delimiter, then matching each token against specific regex
> patterns. This avoids the reliance on captured subgroups.

In my specific case, I was trying to use the facility to parse XPath
1.0 expressions, so tokenizing on space isn't an option there. Of
course, one can first tokenize using analyze-string, and then use
matches() on each token separately, but this is still rather
inconvenient, as well as a needless performance hit because of double
matching.

Current Thread