Re: [xsl] [XSLT2.0] xsl:analyze-string@regex syntax too limited

Subject: Re: [xsl] [XSLT2.0] xsl:analyze-string@regex syntax too limited
From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 16 Dec 2004 18:14:56 -0500
Thanks, good find. The only problem now is that this issue needs to be 
adressed in java.util.regex.

Colin Paul Adams wrote:

>>>>>>"Gunther" == Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx> writes:
> 
> 
>     Gunther> The boundary matcher matches a zero-width substring
>     Gunther> between a character matching the character class
>     Gunther> [A-Za-z_0-9] and a character matching the character class
>     Gunther> [^A-Za-z_0-9] or vice versa.  </quote>
> 
>     Gunther> This is pretty clear. It may not make the
>     Gunther> internationalization people very happy because I can't do
>     Gunther> word-boundary matches on Hindi text. That's a true
>     Gunther> concern.
> 
> So address it. Unicode report TR18 says (for Level 1 support):
> 
> RL1.4  	Simple Word Boundaries
> 	To meet this requirement, an implementation shall extend the word boundary mechanism so that:
> 
>    1.
> 
>       The class of <word_character> includes all the Alphabetic values from the Unicode character database, from UnicodeData.txt [UData]. See also Annex C: Compatibility Properties.
>    2.
> 
>       Non-spacing marks are never divided from their base characters, and otherwise ignored in locating boundaries. 
> 
> Level 2 provides more general support for word boundaries between
> arbitrary Unicode characters which may override this behavior.
> 
> Level 1 support should certainly be met.

-- 
Gunther Schadow, M.D., Ph.D.                  gschadow@xxxxxxxxxxxxxxx
Associate Professor           Indiana University School of Informatics
Regenstrief Institute, Inc.      Indiana University School of Medicine
tel:1(317)630-7960                       http://aurora.regenstrief.org

Current Thread