RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions

Subject: RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Mon, 12 Jun 2006 21:28:30 +0100
The CJKCompatibility block covers the codepoint range x3300-x33FF only. I
would imagine that to match Japanese language characters you are looking for
a much larger range than this.

If the range of codepoints you want to match doesn't correspond to one of
the named blocks you can always write, for example [&_#x3000;-&_#xFE4F;]
(without the underscores).

Michael Kay
http://www.saxonica.com/ 

> -----Original Message-----
> From: jbesch@xxxxxxx [mailto:jbesch@xxxxxxx] 
> Sent: 12 June 2006 20:26
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Cc: jbesch@xxxxxxx
> Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular 
> expressions
> 
> > How, for example, to use a useful syntax like
> >   matches(.,'\p{Script:Arabic}+') ?
> >
> >schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs
> >
> >[Definition:] [Unicode Database] groups code points into a number of 
> >blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul 
> >Jamo, CJK Compatibility, etc. The set containing all characters that 
> >have block name X (with all white space stripped out), can be 
> >identified with a block escape \p{IsX}. The complement of 
> this set is 
> >specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]).
> >...
> >For example,
> >the .block escape. for identifying the ASCII characters is 
> \p{IsBasicLatin}.
> >
> >so that would be \p(IsArabic)
> >
> >David
> 
> 
> 
> I want to use the above construct to detect Japanese 
> characters, and so I am using the following xsl:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet version="2.0" 
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
>      <xsl:output method="xml" indent="yes" encoding="UTF-8" />
>      <xsl:template match="/text">
>         <xsl:for-each select="tokenize(.,'\s+')">
>           <word>
>             <xsl:attribute name="language">
>               <xsl:choose>
>                  <xsl:when 
> test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when>
>                  <xsl:when 
> test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when>
>                  <xsl:otherwise>Unknown</xsl:otherwise>
>               </xsl:choose>
>             </xsl:attribute>
>           </word>
>         </xsl:for-each>
>      </xsl:template>
> </xsl:stylesheet>
> 
> However, the Japanese characters in my input, which are 
> encoded in UTF-8, come out flagged as Latin or Unknown.  What 
> am I doing wrong?  How do I get this to recognize the 
> Japanese characters?
> 
> Thanks for any help you can offer.
> 
> John Besch

Current Thread