Re: [xsl] analyze-string gotcha/reminder

Subject: Re: [xsl] analyze-string gotcha/reminder
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Mon, 19 Nov 2012 09:12:33 +0000
I feel your pain. Many of us have lost a few hairs over this one. The good news is that you probably won't make the same mistake again, or if you do, you will spot it far more quickly.

It's a case where even in retrospect, it's hard to see how we could have avoided this problem in the language design. Perhaps two separate attributes, regex and regex-avt. But that feels very heavy-handed. Most languages have a few quirks like this where people just have to learn the hard way.

Michael Kay
Saxonica

On 18/11/2012 18:18, Ihe Onwuka wrote:
Below is a multiple match meant to extract 4 digit numbers from text

	         <xsl:analyze-string select="$line" regex="(\D|^)(\d{4})(\D|$)">
                    <xsl:matching-substring>		
                      <year><xsl:value-of
select="regex-group(2)"/></year>
                    </xsl:matching-substring>
                  </xsl:analyze-string

It doesn't work. I tried exactly the same regex in XQuery using replace

xquery version "1.0";
replace('Accounting Items                                Dec.31,2005
  Dec.31,2006    Dec.31,2007
Dec.31,2008','(\D|^)\d{4}(\D|$)','xxxx')

it worked and I got

Accounting Items                                Dec.31xxxx
Dec.31xxxx   Dec.31xxxx   Dec.31xxxx

I thought maybe there was special syntax for the multiple match case - but no.
Eventually I turned to the specification and found this.

Note:
Because the regex attribute is an attribute value template, curly
brackets within the regular expression must be doubled. For example,
to match a sequence of one to five characters, write regex=".{{1,5}}".
For regular expressions containing many curly brackets it may be more
convenient to use a notation such as
regex="{'[0-9]{1,5}[a-z]{3}[0-9]{1,2}'}", or to use a variable.

So I had to double up my curly braces.

There's an hour of my life that I won't get back.

Current Thread