Hi,
I don't know if my XSLT syntax is wrong or if it is a Saxon-related
problem. Let's blame the XSLT writer rather than the XSLT processor
first ;-)
Given the following XML :
<?xml version="1.0" encoding="UTF-8"?>
<text>livre : YX*X'X(</text>
And the following XSLT :
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="2.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="/text">
<xsl:comment><xsl:value-of select="system-property('xsl:vendor')"
/></xsl:comment>
<words>
<xsl:for-each select="tokenize(.,'\s+')">
<word>
<xsl:attribute name="language">
<xsl:choose>
<xsl:when test="matches(.,'[a-z]+')">latin</xsl:when>
<xsl:when
test="matches(.,'[\\u0600-\\u06FF]+')">arabic</xsl:when>
<xsl:otherwise>whatever</xsl:otherwise>
</xsl:choose>
</xsl:attribute>
<xsl:attribute name="codepoints"><xsl:value-of
select="string-to-codepoints(.)"/></xsl:attribute>
<xsl:value-of select="."/>
</word>
</xsl:for-each>
</words>
</xsl:template>
</xsl:stylesheet>
I get :
<?xml version="1.0" encoding="UTF-8"?>
<!--SAXON 8.0 from Saxonica-->
<words>
<word language="latin" codepoints="108 105 118 114 101">livre</word>
<word language="arabic" codepoints="58">:</word>
<word language="whatever" codepoints="1603 1578 1575 1576">YX*X'X(</word>
</words>
Why this curious match for codepoint 58 ? And why no match for the
arabic characters ?
BTW, I first tried : matches(.,'[\u0600-\u06FF]+') as stated by
http://www.unicode.org/reports/tr18/#Hex_notation
But Saxon returned the following error :
Error at xsl:when on line 11 of file:/C:/...:
net.sf.saxon.type.RegexTranslator$RegexSyntaxException: Error at
character 2 in regular expression: bad escape sequence
That's why I doubled the "\" character. Is this doubling spec-compliant ?
Cheers,
p.b.