Subject: Re: [xsl] lookaheads in XSLT2 regexes From: "Imsieke, Gerrit, le-tex" <gerrit.imsieke@xxxxxxxxx> Date: Thu, 04 Mar 2010 22:30:05 +0100 |
I feel that \b is very much tied to a specific set of characters which might not be exactly the set you want. I'd be more comfortable providing general-purpose zero-width look-ahead and look-behind:
regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"
A compromise will be (as suggested above): - allow concise \b and \w syntax in the regexes, - per-stylesheet means to redefine the default word constituent expression
which seems far more powerful.
Regards,
Michael Kay http://www.saxonica.com/ http://twitter.com/michaelhkay
-----Original Message----- From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx] Sent: 04 March 2010 17:12 To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] lookaheads in XSLT2 regexes
Dear Liam,
Thanks for promoting the \b case. As an illustration for \b's usefulness, let me show how I tag acronyms for a recent project:
<xsl:template match="text()" mode="majuscules"> <xsl:analyze-string select="." regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{ {Z}}\p{{C}}]|$)"> <xsl:matching-substring> <xsl:value-of select="regex-group(1)"/> <span class="majusc"> <xsl:value-of select="regex-group(2)"/> </span> <xsl:value-of select="regex-group(3)"/> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
With (a reasonably defined) \b, this could be simplified to
<xsl:template match="text()" mode="majuscules"> <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b"> <xsl:matching-substring> <span class="majusc"> <xsl:value-of select="."/> </span> </xsl:matching-substring> <xsl:non-matching-substring> <xsl:value-of select="."/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template>
Please note that \b should not only match the \w/\W boundary, but also the beginning or end of the string (or line, when the 'm' flag is in force). Speaking of the 'm' flag, and in Michael's direction: I regard \b as much more useful than the 'm' flag when processing XML.
Gerrit
On 04.03.2010 06:59, Liam R E Quin wrote:On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:the mostOn the subject of \b I'll note we do have \W and \w
So we do, I overlooked that. And we define it a little differently from Perl:
[#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
So for example "+" is regarded as part of a word, while "-" isn't. Which strikes me as totally useless, to be honest.
I agree.
We could fix that for XPath 2.1 I think. I'm not sure whatuseful fix would be, I admit.probably work for
The Perl definition of "alphanumeric" plus "_" would\w, if one took alphnumeric to mean Letters|Numbers,\p{L}|\p{N}, andexpressions;is coincidentally closer to what you get in Perl if you do use locale; and your locale is (say) en_UK.UTF8, as it's then the same as the POSIX fragment [[:alpha:][:digit:]_]
There are lots of things that could be added to regularbut \b is hard to emulate, useful, and also we seem to havea ratherodd \w. If \w is there, I think \b was omitted by mistake.Or that\w was included by mistake!
Liam
-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vvckler
-- Gerrit Imsieke Geschdftsf|hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
Registergericht / Commercial Register: Amtsgericht Leipzig Registernummer / Registration Number: HRB 24930
Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas Schmidt, Dr. Reinhard Vvckler
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] lookaheads in XSLT2 regex, Michael Kay | Thread | Re: [xsl] lookaheads in XSLT2 regex, Dave Pawson |
Re: [xsl] Pattern Substring, Wendell Piez | Date | Re: [xsl] XSLT for Mashups, Florent Georges |
Month |