RE: [xsl] A better xsl:analyze-string

Subject: RE: [xsl] A better xsl:analyze-string
From: "Michael Sokolov" <sokolov@xxxxxxxxxxxx>
Date: Thu, 20 Aug 2009 20:29:30 -0400
I like the avoidance of the clumsy numbered capture groups, (my non-starting
proposal would be to add perl capture variables ala $1,$2,$3,$@,$'$`,etc...)

But how would you retrieve the value of the matching subgroup (the decimal
portion) in:
<xsl:matching-substring regex="\d+(\.\d*)?">

There's something asymmetric about your proposal that bothers me.  There are
other cases of combining multiple capture groups that wouldn't get the same
special treatment (like the nesting in the example).  Why assume that
capture groups are always combined as (...)|(...)?  It's a very special
case: is it so common as to warrant special syntax?

-Mike

> -----Original Message-----
> From: Pavel Minaev [mailto:int19h@xxxxxxxxx] 
> Sent: Thursday, August 20, 2009 5:59 PM
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [xsl] A better xsl:analyze-string
> 
> On Thu, Aug 20, 2009 at 2:39 PM, Michael Kay<mike@xxxxxxxxxxxx> wrote:
> > It's true that using regex-group() is a pretty messy 
> mechanism, and it 
> > would be nice to do better.
> >
> > Things get a bit more complicated if there are groups that 
> can match 
> > more than once, of the form (...)*. It's not clear how that 
> would work 
> > with your proposed syntax.
> >
> > One of the constraints is that we want to ensure that the 
> facilities 
> > can be implemented on top of popular regex libraries such as those 
> > used by Java, C#, or Perl. These are all very heavily based on the 
> > concept of numbered captured groups, with all their quirks.
> 
> I'm not sure I understand. How having (...)* would affect the 
> syntax or semantics in any way?
> 
> The straightforward implementation of this that I imagine is 
> a simple rewrite. If we have matching-substring instructions 
> for regexes rx1, rx2, .... rxN, the implementation rewrites 
> it as a single regex:
> 
>   (rx1)|(rx2)|...|(rxN)
> 
> and then counts the parentheses to determine the group number 
> of each of the original tokens. From there it's a trivial 
> rewrite to choose/when form. Counting parentheses is 
> sufficient per the spec for
> regex-group() function:
> 
> "The Nth captured substring (where N > 0) is the string 
> matched by the subexpression contained by the Nth left 
> parenthesis in the regex. "
> 
> So any quantifiers on groups shouldn't affect this. It would, 
> of course, also have to correct group number for any user call to
> regex-group() from within matching-substring, but that is 
> similarly trivial.
> 
> By the way, as a side question - what is regex-group() 
> supposed to return in XSLT 2.0 at present when the 
> corresponding subexpression matches more than once - as it 
> may do in (...)* case?
> 
> > I suggest you post this to the W3C bugzilla database as a 
> comment on 
> > the spec, which means it will go on the WG agenda for 
> consideration. 
> > The status section of the spec gives you a pointer.
> 
> I wanted to discuss it here first to see if there are any 
> obvious design flaws that I've missed, or other relevant 
> scenarios that others have encountered. The idea is to submit 
> this as a spec comment in the end, yes.
> 
> > I think that in many real-life cases one can solve this problem by 
> > doing two levels of matching. For example, you can often do it by 
> > first tokenizing with space as a delimiter, then matching 
> each token 
> > against specific regex patterns. This avoids the reliance 
> on captured subgroups.
> 
> In my specific case, I was trying to use the facility to 
> parse XPath 1.0 expressions, so tokenizing on space isn't an 
> option there. Of course, one can first tokenize using 
> analyze-string, and then use
> matches() on each token separately, but this is still rather 
> inconvenient, as well as a needless performance hit because 
> of double matching.

Current Thread