RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions

Subject: RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions
From: "Michael Kay" <mhk@xxxxxxxxx>
Date: Thu, 12 Aug 2004 12:12:08 +0100
The notation \u1234 is not supported in XPath 2.0 regular expressions. Use
&#x1234; instead.

Michael Kay
 

> -----Original Message-----
> From: Pierrick Brihaye [mailto:pierrick.brihaye@xxxxxxxxxx] 
> Sent: 12 August 2004 10:38
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions
> 
> Hi,
> 
> I don't know if my XSLT syntax is wrong or if it is a Saxon-related 
> problem. Let's blame the XSLT writer rather than the XSLT processor 
> first ;-)
> 
> Given the following XML :
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <text>livre : ????</text>
> 
> And the following XSLT :
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0" 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
>    <xsl:template match="/text">
>      <xsl:comment><xsl:value-of 
> select="system-property('xsl:vendor')" 
> /></xsl:comment>
>      <words>
>        <xsl:for-each select="tokenize(.,'\s+')">
>          <word>
>            <xsl:attribute name="language">
>              <xsl:choose>
>                <xsl:when test="matches(.,'[a-z]+')">latin</xsl:when>
>                <xsl:when 
> test="matches(.,'[\\u0600-\\u06FF]+')">arabic</xsl:when>
>                <xsl:otherwise>whatever</xsl:otherwise>
>              </xsl:choose>
>            </xsl:attribute>
>            <xsl:attribute name="codepoints"><xsl:value-of 
> select="string-to-codepoints(.)"/></xsl:attribute>
>            <xsl:value-of select="."/>
>          </word>
>        </xsl:for-each>
>      </words>
>    </xsl:template>
> </xsl:stylesheet>
> 
> I get :
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <!--SAXON 8.0 from Saxonica-->
> <words>
>    <word language="latin" codepoints="108 105 118 114 
> 101">livre</word>
>    <word language="arabic" codepoints="58">:</word>
>    <word language="whatever" codepoints="1603 1578 1575 
> 1576">????</word>
> </words>
> 
> Why this curious match for codepoint 58 ? And why no match for the 
> arabic characters ?
> 
> BTW, I first tried : matches(.,'[\u0600-\u06FF]+') as stated by 
> http://www.unicode.org/reports/tr18/#Hex_notation
> 
> But Saxon returned the following error :
> 
> Error at xsl:when on line 11 of file:/C:/...:
>    net.sf.saxon.type.RegexTranslator$RegexSyntaxException: Error at 
> character 2 in regular expression: bad escape sequence
> 
> That's why I doubled the "\" character. Is this doubling 
> spec-compliant ?
> 
> Cheers,
> 
> p.b.

Current Thread