Subject: Re: [xsl] lookaheads in XSLT2 regexes From: Michael Ludwig <milu71@xxxxxx> Date: Tue, 2 Mar 2010 01:50:17 +0100 |
Michael Kay schrieb am 01.03.2010 um 17:52:15 (-0000): > > > > I didn't realise we were missing \b -- we should add it, if > > that's the case. > > > > I think it was omitted deliberately, on the grounds that it's > locale-sensitive. It's defined in Perl as matching "a spot between two > characters that has a \w on one side of it and a \W on the other side > of it (in either order)", where \w matches a "word" character (defined > as "alphanumeric" plus "_"), in which "the list of alphabetic > characters generated by \w is taken from the current locale". Defining words according to a locale setting is often pretty useful. Note that you have to explicitly turn this behaviour on in Perl by using the locale pragma: michael@wladimir:~ :-) export LANG=de_DE michael@wladimir:~ :-) perl -lwe '$_="Kdse"; m/\w+/; print $&' K michael@wladimir:~ :-) perl -Mlocale -lwe '$_="Kdse"; m/\w+/; print $&' Kdse But since Perl has started to support Unicode, this has lost importance. Marking a string as Unicode makes all operations applied to it Unicode-aware, including regular expressions: $ perl -lwe '$_="KC$se"; print; m/\w+/; print $&' KC$se K $ perl -Mutf8 -lwe '$_="KC$se"; print; m/\w+/; print $&' Kdse Kdse $ perl -MEncode -lwe '$_=decode "ISO-8859-1","Kdse"; m/\w+/; print $&' Kdse It also matches Japanese ideographs, regardless of locale or LANG. The definition of "word character" here is the Unicode property database. It is a bit complex in Perl because of the byte=character legacy. At least that's a legacy XML does not have to support :-) > That's not an acceptable definition for our purposes, so it's arguably > better to have no definition at all. I don't know about acceptability, but the Perl way of handling \w, if confusing unless you understand the rationale (the history), looks useful to me. After all, real words occur rather frequently in real text, and it's not a totally unrealistic requirement to do something with them. If you don't want the \w magic, you can still define explicit character classes, of course. Or use the Unicode properties. > We could perhaps define \w to match "alphanumeric" as the term is used > in xsl:number (categories Nd, Nl, No, Lu, Ll, Lt, Lm or Lo) and then > it's a well-defined concept, though not necessarily one that matches > user expectations. > > The fact that Perl overloads \b to mean backspace when within a > character class doesn't help. The word boundary is not a character, so why use it inside a character class? The only characters that may have special meaning inside a character class are ^ for negation (not beginning of line) and - (for ranges). > And one feels that if it's useful to have a metacharacter that matches > the spot between a character in one character class and a character in > its complement, then one ought to generalize the concept so it works > with any character class, not just the rather arbitrary class > containing Nd, Nl, No, Lu, Ll, Lt, Lm and Lo. I'd say it's historically established. Even good old grep has it (as \< and \>). People know about it and use it a lot, so why not support it as an easily memorizable shorthand notation? After all, this is about words, and words are a frequent use case. Best, -- Michael Ludwig
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] lookaheads in XSLT2 regex, Imsieke, Gerrit, le- | Thread | RE: [xsl] lookaheads in XSLT2 regex, Michael Kay |
Re: [xsl] lookaheads in XSLT2 regex, Imsieke, Gerrit, le- | Date | [xsl] New worldwide XML-related tra, G. Ken Holman |
Month |