RE: [xsl] lookaheads in XSLT2 regexes

Subject: RE: [xsl] lookaheads in XSLT2 regexes
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 4 Mar 2010 17:39:04 -0000
I feel that \b is very much tied to a specific set of characters which might
not be exactly the set you want. I'd be more comfortable providing
general-purpose zero-width look-ahead and look-behind:

regex="(?<=\P{L})\p{{Lu}}{{2,}}(?=\P{L})"

which seems far more powerful.

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay

> -----Original Message-----
> From: Imsieke, Gerrit, le-tex [mailto:gerrit.imsieke@xxxxxxxxx]
> Sent: 04 March 2010 17:12
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: [xsl] lookaheads in XSLT2 regexes
>
> Dear Liam,
>
> Thanks for promoting the \b case. As an illustration for \b's
> usefulness, let me show how I tag acronyms for a recent project:
>
>    <xsl:template match="text()" mode="majuscules">
>      <xsl:analyze-string select="."
> regex="(^|[\p{{P}}\p{{Z}}\p{{C}}])(\p{{Lu}}{{2,}})([\p{{P}}\p{
> {Z}}\p{{C}}]|$)">
>        <xsl:matching-substring>
>          <xsl:value-of select="regex-group(1)"/>
>          <span class="majusc">
>            <xsl:value-of select="regex-group(2)"/>
>          </span>
>          <xsl:value-of select="regex-group(3)"/>
>        </xsl:matching-substring>
>        <xsl:non-matching-substring>
>          <xsl:value-of select="."/>
>        </xsl:non-matching-substring>
>      </xsl:analyze-string>
>    </xsl:template>
>
> With (a reasonably defined) \b, this could be simplified to
>
>    <xsl:template match="text()" mode="majuscules">
>      <xsl:analyze-string select="." regex="\b\p{{Lu}}{{2,}}\b">
>        <xsl:matching-substring>
>          <span class="majusc">
>            <xsl:value-of select="."/>
>          </span>
>        </xsl:matching-substring>
>        <xsl:non-matching-substring>
>          <xsl:value-of select="."/>
>        </xsl:non-matching-substring>
>      </xsl:analyze-string>
>    </xsl:template>
>
> Please note that \b should not only match the \w/\W boundary,
> but also the beginning or end of the string (or line, when
> the 'm' flag is in force). Speaking of the 'm' flag, and in
> Michael's direction: I regard \b as much more useful than the
> 'm' flag when processing XML.
>
> Gerrit
>
>
>
> On 04.03.2010 06:59, Liam R E Quin wrote:
> > On Wed, 2010-03-03 at 21:27 +0000, Michael Kay wrote:
> >>> On the subject of \b I'll note we do have \W and \w
> >>
> >> So we do, I overlooked that. And we define it a little differently
> >> from
> >> Perl:
> >>
> >> [#x0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]
> >>
> >> So for example "+" is regarded as part of a word, while "-" isn't.
> >> Which strikes me as totally useless, to be honest.
> >
> > I agree.
> >
> > We could fix that for XPath 2.1 I think.  I'm not sure what
> the most
> > useful fix would be, I admit.
> >
> > The Perl definition of "alphanumeric" plus "_" would
> probably work for
> > \w, if one took alphnumeric to mean Letters|Numbers,
> \p{L}|\p{N}, and
> > is coincidentally closer to what you get in Perl if you do
> >      use locale;
> > and your locale is (say) en_UK.UTF8, as it's then the same as the
> > POSIX fragment [[:alpha:][:digit:]_]
> >
> > There are lots of things that could be added to regular
> expressions;
> > but \b is hard to emulate, useful, and also we seem to have
> a rather
> > odd \w.  If \w is there, I think \b was omitted by mistake.
>  Or that
> > \w was included by mistake!
> >
> > Liam
> >
>
> --
> Gerrit Imsieke
> Geschdftsf|hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341
> 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx,
> http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> Geschdftsf|hrer: Gerrit Imsieke, Svea Jelonek, Thomas
> Schmidt, Dr. Reinhard Vvckler

Current Thread