Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions From: John Besch <jbesch@xxxxxxx> Date: Mon, 12 Jun 2006 15:25:34 -0400 |
> How, for example, to use a useful syntax like > matches(.,'\p{Script:Arabic}+') ? > >schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs > >[Definition:] [Unicode Database] groups code points into a number of >blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul >Jamo, CJK Compatibility, etc. The set containing all characters that >have block name X (with all white space stripped out), can be identified >with a block escape \p{IsX}. The complement of this set is specified >with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]). >... >For example, >the 7block escape7 for identifying the ASCII characters is \p{IsBasicLatin}. > >so that would be \p(IsArabic) > >David I want to use the above construct to detect Japanese characters, and so I am using the following xsl: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="xml" indent="yes" encoding="UTF-8" /> <xsl:template match="/text"> <xsl:for-each select="tokenize(.,'\s+')"> <word> <xsl:attribute name="language"> <xsl:choose> <xsl:when test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when> <xsl:when test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when> <xsl:otherwise>Unknown</xsl:otherwise> </xsl:choose> </xsl:attribute> </word> </xsl:for-each> </xsl:template> </xsl:stylesheet> However, the Japanese characters in my input, which are encoded in UTF-8, come out flagged as Latin or Unknown. What am I doing wrong? How do I get this to recognize the Japanese characters? Thanks for any help you can offer. John Besch
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Re: Schematron with XPath 2.0, David Sewell | Thread | RE: [xsl] XSLT 2.0 : Unicode hex no, Michael Kay |
Re: [xsl] Transforming multiple XML, Gowri Ratakonda | Date | [xsl] Dynamic columns for xslt, Tham Tinh |
Month |