Subject: RE: [xsl] [XSLT2.0] xsl:analyze-string@regex syntax too limited From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Thu, 16 Dec 2004 09:19:20 -0000 |
I think this post illustrates why these questions always involve more WG debate than one initially expects, and why the WG is now taking a fairly strict line on "no new functionality". (The Perl definition of \b, incidentally, is quite different from that quoted.) The other suggestion Gunther made was to relax the rules on vendor extensions to the regex syntax. There is in fact a proposal on the table from one of the XQuery vendors to do that. Traditionally the XSL WG has taken a pretty tough line on vendor extensions, the principle being that it must be possible for a processor to detect that extensions are in use, and it must be possible for a user to write fallback code that keeps the stylesheet portable. This policy can be traced back to the original expectation that XSLT would usually run in the browser, and the stylesheet author had no control over which browser it would run in. But I think the policy has served the community well. Michael Kay http://www.saxonica.com/ > -----Original Message----- > From: Colin Paul Adams [mailto:colin@xxxxxxxxxxxxxxxxxx] > Sent: 16 December 2004 07:25 > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: Re: [xsl] [XSLT2.0] xsl:analyze-string@regex syntax > too limited > > >>>>> "Gunther" == Gunther Schadow > <gunther@xxxxxxxxxxxxxxxxxxxxxx> writes: > > Gunther> The boundary matcher matches a zero-width substring > Gunther> between a character matching the character class > Gunther> [A-Za-z_0-9] and a character matching the character class > Gunther> [^A-Za-z_0-9] or vice versa. </quote> > > Gunther> This is pretty clear. It may not make the > Gunther> internationalization people very happy because I can't do > Gunther> word-boundary matches on Hindi text. That's a true > Gunther> concern. > > So address it. Unicode report TR18 says (for Level 1 support): > > RL1.4 Simple Word Boundaries > To meet this requirement, an implementation shall > extend the word boundary mechanism so that: > > 1. > > The class of <word_character> includes all the > Alphabetic values from the Unicode character database, from > UnicodeData.txt [UData]. See also Annex C: Compatibility Properties. > 2. > > Non-spacing marks are never divided from their base > characters, and otherwise ignored in locating boundaries. > > Level 2 provides more general support for word boundaries between > arbitrary Unicode characters which may override this behavior. > > Level 1 support should certainly be met. > -- > Colin Paul Adams > Preston Lancashire
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] [XSLT2.0] xsl:analyze-str, Colin Paul Adams | Thread | Re: [xsl] [XSLT2.0] xsl:analyze-str, Gunther Schadow |
RE: [xsl] no attributes outputed wh, Michael Kay | Date | Re: [xsl] Tree from directory listi, Geert Josten |
Month |