[xsl] XSLT 2.0 : Unicode hex notation in regular expressions

Subject: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions
From: Pierrick Brihaye <pierrick.brihaye@xxxxxxxxxx>
Date: Thu, 12 Aug 2004 11:38:08 +0200
Hi,

I don't know if my XSLT syntax is wrong or if it is a Saxon-related problem. Let's blame the XSLT writer rather than the XSLT processor first ;-)

Given the following XML :

<?xml version="1.0" encoding="UTF-8"?>
<text>livre : YX*X'X(</text>

And the following XSLT :

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:template match="/text">
<xsl:comment><xsl:value-of select="system-property('xsl:vendor')" /></xsl:comment>
<words>
<xsl:for-each select="tokenize(.,'\s+')">
<word>
<xsl:attribute name="language">
<xsl:choose>
<xsl:when test="matches(.,'[a-z]+')">latin</xsl:when>
<xsl:when test="matches(.,'[\\u0600-\\u06FF]+')">arabic</xsl:when>
<xsl:otherwise>whatever</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<xsl:attribute name="codepoints"><xsl:value-of select="string-to-codepoints(.)"/></xsl:attribute>
<xsl:value-of select="."/>
</word>
</xsl:for-each>
</words>
</xsl:template>
</xsl:stylesheet>


I get :

<?xml version="1.0" encoding="UTF-8"?>
<!--SAXON 8.0 from Saxonica-->
<words>
  <word language="latin" codepoints="108 105 118 114 101">livre</word>
  <word language="arabic" codepoints="58">:</word>
  <word language="whatever" codepoints="1603 1578 1575 1576">YX*X'X(</word>
</words>

Why this curious match for codepoint 58 ? And why no match for the arabic characters ?

BTW, I first tried : matches(.,'[\u0600-\u06FF]+') as stated by http://www.unicode.org/reports/tr18/#Hex_notation

But Saxon returned the following error :

Error at xsl:when on line 11 of file:/C:/...:
net.sf.saxon.type.RegexTranslator$RegexSyntaxException: Error at character 2 in regular expression: bad escape sequence


That's why I doubled the "\" character. Is this doubling spec-compliant ?

Cheers,

p.b.

Current Thread