Subject: RE: [xsl] XSLT 2.0 : Unicode hex notation in regular expressions From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Mon, 12 Jun 2006 21:28:30 +0100 |
The CJKCompatibility block covers the codepoint range x3300-x33FF only. I would imagine that to match Japanese language characters you are looking for a much larger range than this. If the range of codepoints you want to match doesn't correspond to one of the named blocks you can always write, for example [&_#x3000;-&_#xFE4F;] (without the underscores). Michael Kay http://www.saxonica.com/ > -----Original Message----- > From: jbesch@xxxxxxx [mailto:jbesch@xxxxxxx] > Sent: 12 June 2006 20:26 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Cc: jbesch@xxxxxxx > Subject: Re: [xsl] XSLT 2.0 : Unicode hex notation in regular > expressions > > > How, for example, to use a useful syntax like > > matches(.,'\p{Script:Arabic}+') ? > > > >schema-2 says: http://www.w3.org/TR/xmlschema-2/#regexs > > > >[Definition:] [Unicode Database] groups code points into a number of > >blocks such as Basic Latin (i.e., ASCII), Latin-1 Supplement, Hangul > >Jamo, CJK Compatibility, etc. The set containing all characters that > >have block name X (with all white space stripped out), can be > >identified with a block escape \p{IsX}. The complement of > this set is > >specified with the block escape \P{IsX}. ([\P{IsX}] = [^\p{IsX}]). > >... > >For example, > >the .block escape. for identifying the ASCII characters is > \p{IsBasicLatin}. > > > >so that would be \p(IsArabic) > > > >David > > > > I want to use the above construct to detect Japanese > characters, and so I am using the following xsl: > > <?xml version="1.0" encoding="UTF-8"?> > <xsl:stylesheet version="2.0" > xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> > <xsl:output method="xml" indent="yes" encoding="UTF-8" /> > <xsl:template match="/text"> > <xsl:for-each select="tokenize(.,'\s+')"> > <word> > <xsl:attribute name="language"> > <xsl:choose> > <xsl:when > test="matches(.,'\p{IsCJKCompatibility}+')">Japanese</xsl:when> > <xsl:when > test="matches(.,'\p{IsBasicLatin}+')">Latin</xsl:when> > <xsl:otherwise>Unknown</xsl:otherwise> > </xsl:choose> > </xsl:attribute> > </word> > </xsl:for-each> > </xsl:template> > </xsl:stylesheet> > > However, the Japanese characters in my input, which are > encoded in UTF-8, come out flagged as Latin or Unknown. What > am I doing wrong? How do I get this to recognize the > Japanese characters? > > Thanks for any help you can offer. > > John Besch
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] XSLT 2.0 : Unicode hex no, John Besch | Thread | [xsl] Dynamic columns for xslt, Tham Tinh |
[xsl] Dynamic columns for xslt, Tham Tinh | Date | RE: [xsl] Dynamic columns for xslt, Michael Kay |
Month |