Subject: RE: [xsl] lookaheads in XSLT2 regexes From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Thu, 4 Mar 2010 17:39:04 -0000 |
I feel that \b is very much tied to a specific set of characters which might not be exactly the set you want. I'd be more comfortable providing general-purpose zero-width look-ahead and look-behind: regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})" which seems far more powerful. Regards, Michael Kay http://www.saxonica.com/ http://twitter.com/michaelhkay > -----Original Message----- > From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx] > Sent: 04 March 2010 17:12 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: Re: [xsl] lookaheads in XSLT2 regexes > > Dear Liam, > > Thanks for promoting the \b case. As an illustration for \b's > usefulness, let me show how I tag acronyms for a recent project: > > <xsl:template match="text()" mode="majuscules"> > <xsl:analyze-string select="." > regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{ > {Z}}\p{{C}}]|$)"> > <xsl:matching-substring> > <xsl:value-of select="regex-group(1)"/> > <span class="majusc"> > <xsl:value-of select="regex-group(2)"/> > </span> > <xsl:value-of select="regex-group(3)"/> > </xsl:matching-substring> > <xsl:non-matching-substring> > <xsl:value-of select="."/> > </xsl:non-matching-substring> > </xsl:analyze-string> > </xsl:template> > > With (a reasonably defined) \b, this could be simplified to > > <xsl:template match="text()" mode="majuscules"> > <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b"> > <xsl:matching-substring> > <span class="majusc"> > <xsl:value-of select="."/> > </span> > </xsl:matching-substring> > <xsl:non-matching-substring> > <xsl:value-of select="."/> > </xsl:non-matching-substring> > </xsl:analyze-string> > </xsl:template> > > Please note that \b should not only match the \w/\W boundary, > but also the beginning or end of the string (or line, when > the 'm' flag is in force). Speaking of the 'm' flag, and in > Michael's direction: I regard \b as much more useful than the > 'm' flag when processing XML. > > Gerrit > > > > On 04.03.2010 06:59, Liam R E Quin wrote: > > On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote: > >>> On the subject of \b I'll note we do have \W and \w > >> > >> So we do, I overlooked that. And we define it a little differently > >> from > >> Perl: > >> > >> [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}] > >> > >> So for example "+" is regarded as part of a word, while "-" isn't. > >> Which strikes me as totally useless, to be honest. > > > > I agree. > > > > We could fix that for XPath 2.1 I think. I'm not sure what > the most > > useful fix would be, I admit. > > > > The Perl definition of "alphanumeric" plus "_" would > probably work for > > \w, if one took alphnumeric to mean Letters|Numbers, > \p{L}|\p{N}, and > > is coincidentally closer to what you get in Perl if you do > > use locale; > > and your locale is (say) en_UK.UTF8, as it's then the same as the > > POSIX fragment [[:alpha:][:digit:]_] > > > > There are lots of things that could be added to regular > expressions; > > but \b is hard to emulate, useful, and also we seem to have > a rather > > odd \w. If \w is there, I think \b was omitted by mistake. > Or that > > \w was included by mistake! > > > > Liam > > > > -- > Gerrit Imsieke > Geschdftsf|hrer / Managing Director > le-tex publishing services GmbH > Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 > 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, > http://www.le-tex.de > > Registergericht / Commercial Register: Amtsgericht Leipzig > Registernummer / Registration Number: HRB 24930 > > Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas > Schmidt, Dr. Reinhard Vvckler
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] lookaheads in XSLT2 regex, Imsieke, Gerrit, le- | Thread | Re: [xsl] lookaheads in XSLT2 regex, Imsieke, Gerrit, le- |
Re: [xsl] lookaheads in XSLT2 regex, Imsieke, Gerrit, le- | Date | Re: [xsl] Pattern Substring, Wendell Piez |
Month |