RE: Re: [xsl] XPath 2.0 Regex misunderstanding

Subject: RE: Re: [xsl] XPath 2.0 Regex misunderstanding
From: cknell@xxxxxxxxxx
Date: Fri, 19 Jan 2007 16:55:22 -0500
Thanks. I eventually figured out what I needed. I made the match alternatives more explicit and refreshed my memory on the grouping and alternation symbols, and I got something that worked for me. 

After looking at your solution, I shortened it to this:
not(matches(DATE,'^(0[1-9]|1[0-2])/([0-2]\d|3[01])/(2005|2006|2007)'))

That has over 70% fewer characters than my original expression, a welcome result.

I used to write a lot of text document parsing programs in Perl, and I was very sharp with regular expressions. I have gotten a bit rusty since then.

My requirement is that the date part of the field I am parsing be in this format "MM/DD/YYYY", with no skipped characters (that is to say, "01/05/2006" is perfect, while "1/5/06" should fail on three counts. I re-route the records with bad formats back to the data entry people to correct.

Whenever I think of writing regexes, my mind immediately goes to the "Obfuscated Perl Contest" that the Perl Journal used to sponsor.

Thanks for your help.
-- 
Charles Knell
cknell@xxxxxxxxxx - email



-----Original Message-----
From:     Abel Braaksma <abel.online@xxxxxxxxx>
Sent:     Fri, 19 Jan 2007 20:58:56 +0100
To:       xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject:  Re: [xsl] XPath 2.0 Regex misunderstanding

cknell@xxxxxxxxxx wrote:
> I have a date element:
>
> example 
>
> <DATE>11/01/2006</DATE>
>
> I'm trying to write an XPath 2.0 Regex to winnow some of the more obvious date format errors. I have tried for about a half-hour, and I admit to being stumped.

I have some trouble with understanding your "passing" and "failing" is 
about. However, if you are trying to remove the "more obvious date 
format errors", I believe your "matches(...)" needs to become a 
"not(matches(...))", since your regular expression is about inclusion, 
not exclusion.

That said, you can try the following (assuming American dates: 
MM/DD/YYYY) for matching any date, disallowing years > 2006 and allowing 
the format 1/2/2006:

<xsl:variable name="dates">
    <DATE>07/18/2006</DATE>
    <DATE>07/12/2006</DATE>
    <DATE>09/25/2006</DATE>
    <DATE>10/24/2006</DATE>
    <DATE>10/18/2006</DATE>
    <DATE>10/10/2006</DATE>
    <DATE>1/2/2006</DATE>
   
    <!-- false dates -->
    <DATE>22/12/2006</DATE>           
    <DATE>00/10/2000</DATE> 
    <DATE>01/32/2006</DATE>           
    <DATE>10/10/2007</DATE>  
    <DATE>12/12/20006</DATE>   
</xsl:variable>

<xsl:variable name="date-regex">^(
    0?[1-9]|     <!-- 01-09 and 1-9 -->
    1[0-2]       <!-- 10, 11, 12 -->
    )/(
    0?[1-9]|     <!-- 01-09 and 1-9-->
    [1-2]\d|     <!-- 10-20 -->
    3[01]        <!-- 30, 31 -->
    )/(
    1\d{3}|      <!-- 1000-1999 -->
    200[0-6]     <!-- 2000-2006 -->
    )$
</xsl:variable>

<xsl:for-each select="$dates/DATE">
    <xsl:value-of select="concat(., ': ')" />
   
    <!-- add normalize-space, because of a bug
        in saxon prior to 8.0.0.4 with leading space -->
    <xsl:value-of select="matches(.,
        normalize-space($date-regex), 'x')" />
       
    <xsl:text>
</xsl:text>
</xsl:for-each>

This outputs:
07/18/2006: true
07/12/2006: true
09/25/2006: true
10/24/2006: true
10/18/2006: true
10/10/2006: true
1/2/2006: true
22/12/2006: false
00/10/2000: false
01/32/2006: false
12/12/20006: false

>
> Here is the relevant part of the template:
>
> <xsl:when test="matches(DATE,'[0-1][0-2]/[0-3][0-9]/2006')"><bad-date /></xsl:when>

What your statement implies is: output "bad-date" node when:
1) a date month is in the range (00, 01, 02, 10, 11, 12)
2) a date day is in the range (00, 01,... 09, 10, 11,.... 19, 20, 21, 
... 29, 30, 31, ... 39
3) the year is 2006.

Well, I don't know much of your calendar system, but I can hardly 
believe you consider a date as "00/39/2006" as being correct, so here's 
a part of your problem. I know from my own experience that regexing 
numeric values is a tricky business (and is: think strings, not numbers).

For an article I wanted to write for a long time, but still haven't, I 
created a template that helps in regexing numeric values. It will simply 
output the right regexes for you, if you give it a number:

my:regex-from-number('376', 0)
will give:
[0-2]\d{2}|
3[0-6]\d|
37[0-5]|
376|\d{2}

it requires some getting used to, but I recall that Jeffrey Friedl named 
this: enrolling the number, or something similar. For small numbers you 
can easily do it by hand, but it is still hard for many mere mortals. It 
is optimized for repeated digits (like 2006). The output regex works 
perfect.  A few notes (if you plan to use it):

|\d{2}
Leave out this part if you require a fixed number of digits. I.e.: 034 
and 009. By default, 34, 9 etc are allowed.

376
The input number. Repeating the number is not necessary for making a 
bullet proof regular expression, but it made me feel good. The larger 
the maximum number you need to match, the easier it gets putting it 
there: you see instantly what number is being matched.

The rest speaks for itself, I believe. But call in anytime if you want 
some additional help. The expressions in the opening are taken from this 
template to ensure I did the right thing, however, I made them a bit 
more readable.

<xsl:function name="my:regex-from-number">
    <xsl:param name="number" />
    <xsl:param name="pos" />
    <xsl:variable name="digit1" select="substring($number, $pos, 1)" />
    <xsl:variable name="digit2" select="substring($number, $pos + 1, 1)" />
    <xsl:variable name="len" select="string-length($number)" />
   
    <xsl:value-of select="
        if($len = $pos)
            then concat
                (
                    $digit1,
                    '|\d',
                   
                    if($pos - 1 le 1) then ''
                    else concat('{', $pos - 1, '}')
                )
           
        else
            if ($digit2 = '0')
            then concat
                (
                    $digit1,
                    my:regex-from-number($number, $pos + 1)
                )
       
            else concat
                (
                    $digit1,
                   
                    if(xs:integer($digit2) - 1 = 0) then '0'
                    else concat('[0-', xs:integer($digit2) - 1, ']'),
           
                    if($pos + 1 = $len) then '|'
                    else
                        if($len - $pos - 1 = 1) then '\d|'
                        else concat('\d{', $len - $pos - 1, '}|'),
       
                    '
', substring($number, 1, $pos),
                       
                    my:regex-from-number($number, $pos + 1)
                )" />
</xsl:function>

Current Thread