Re: [xsl] lookaheads in XSLT2 regexes

Subject: Re: [xsl] lookaheads in XSLT2 regexes
From: Michael Ludwig <milu71@xxxxxx>
Date: Tue, 2 Mar 2010 01:50:17 +0100
Michael Kay schrieb am 01.03.2010 um 17:52:15 (-0000):
> > 
> > I didn't realise we were missing \b -- we should add it, if 
> > that's the case.
> > 
> 
> I think it was omitted deliberately, on the grounds that it's
> locale-sensitive. It's defined in Perl as matching "a spot between two
> characters that has a \w on one side of it and a \W on the other side
> of it (in either order)", where \w matches a "word" character (defined
> as "alphanumeric" plus "_"), in which "the list of alphabetic
> characters generated by \w is taken from the current locale".

Defining words according to a locale setting is often pretty useful.
Note that you have to explicitly turn this behaviour on in Perl by using
the locale pragma:

michael@wladimir:~ :-) export LANG=de_DE
michael@wladimir:~ :-) perl -lwe '$_="Kdse"; m/\w+/; print $&'
K
michael@wladimir:~ :-) perl -Mlocale -lwe '$_="Kdse"; m/\w+/; print $&'
Kdse

But since Perl has started to support Unicode, this has lost importance.
Marking a string as Unicode makes all operations applied to it
Unicode-aware, including regular expressions:

$ perl -lwe '$_="KC$se"; print; m/\w+/; print $&'
KC$se
K

$ perl -Mutf8 -lwe '$_="KC$se"; print; m/\w+/; print $&'
Kdse
Kdse

$ perl -MEncode -lwe '$_=decode "ISO-8859-1","Kdse"; m/\w+/; print $&'
Kdse

It also matches Japanese ideographs, regardless of locale or LANG. The
definition of "word character" here is the Unicode property database.

It is a bit complex in Perl because of the byte=character legacy. At
least that's a legacy XML does not have to support :-)

> That's not an acceptable definition for our purposes, so it's arguably
> better to have no definition at all.

I don't know about acceptability, but the Perl way of handling \w, if
confusing unless you understand the rationale (the history), looks
useful to me. After all, real words occur rather frequently in real
text, and it's not a totally unrealistic requirement to do something
with them.

If you don't want the \w magic, you can still define explicit character
classes, of course. Or use the Unicode properties.

> We could perhaps define \w to match "alphanumeric" as the term is used
> in xsl:number (categories Nd, Nl, No, Lu, Ll, Lt, Lm or Lo) and then
> it's a well-defined concept, though not necessarily one that matches
> user expectations.
> 
> The fact that Perl overloads \b to mean backspace when within a
> character class doesn't help.

The word boundary is not a character, so why use it inside a character
class? The only characters that may have special meaning inside a
character class are ^ for negation (not beginning of line) and - (for
ranges).

> And one feels that if it's useful to have a metacharacter that matches
> the spot between a character in one character class and a character in
> its complement, then one ought to generalize the concept so it works
> with any character class, not just the rather arbitrary class
> containing Nd, Nl, No, Lu, Ll, Lt, Lm and Lo.

I'd say it's historically established. Even good old grep has it (as \<
and \>). People know about it and use it a lot, so why not support it
as an easily memorizable shorthand notation? After all, this is about
words, and words are a frequent use case.

Best,
-- 
Michael Ludwig

Current Thread