Subject: Re: [xsl] [XSLT2.0] xsl:analyze-string@regex syntax too limited From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx> Date: Wed, 15 Dec 2004 17:25:48 -0500 |
Thanks, Michael, for the "warning". > It doesn't seem to be there yet... ... was held up by the confirmation system they now have in place. Should be there now. > Please note there's no need to comment separately on the two documents. XSLT > will automatically pick up any changes made to the XPath functions. O.K. too late now. I figured: make more noise so you will be heard :-) >>Michael Kay had to add a pretty complex piece of code to his >>Saxon processor just to cripple the available regex syntax which >>was previously supported. That's ridiculous. > > It's very unlikely that XPath will support the whole of the Java regex > syntax, for example the POSIX character classes won't get past the I18N > scrutineers. I never use these character classes, they are unnecessary syntactic shugar. I always use generic character classes [...] instead and never saw the point for remembering all those \w \W \s \p{Quark} things. > Also, Java regexes match 16-bit UTF16 values, not Unicode > characters: so given a character outside the BMP, it counts as two > characters in a Java regex but as one character in an XPath regex - a lot of > the regex translation code in Saxon is designed to handle such differences, > not to remove functionality. O.K. can you actually overcome that? Sounds to me that that's an extension request that needs to go to Java, because I bet that many of the present Java XML processing gizmos would fail on Unicode above the BMP range. Regarding the specifications I see the problem. All I am asking for is to put back \b, (?:...), (?=...) and (?!...). There seems to be now formal regex specification (but there isn't a formal specification for many other things either.) So, all that needs to be done is to add specification of these 4 elements that are as formal as the current XPath F&O specification for regex. It doesn't seem to hard to meet that standard though. See the specification on the reluctant quantifiers. All it really says is "matches the shortest possible substring consistent with the match as a whole succeeding". So, for boundary we can just say: <quote> Boundary matcher is supported. This is indicated by a "\b". The boundary matcher matches a zero-width substring between a character matching the character class [A-Za-z_0-9] and a character matching the character class [^A-Za-z_0-9] or vice versa. </quote> This is pretty clear. It may not make the internationalization people very happy because I can't do word-boundary matches on Hindi text. That's a true concern. Again, something that needs to be taken up with the Java specification as well. As a fallback, positive lookahead and look-behind may help that situation. So, let's address that: <quote> Positive look-ahead is supported. This is indicated by a parenthesis beginning with "(?=" and ending with the matching ")". Positive look-ahead matches if the present matching substring M is followed by a substring L matching the positive lookahead but without L being part of M. </quote> That way a \b could be emulated as "[A-Za-z_0-9]+\b" -> "[A-Za-z_0-9]+(?=[^A-Za-z_0-9])" "\b[A-Za-z_0-9]+" -> "(?<=[^A-Za-z_0-9])[A-Za-z_0-9]+" and I could now use Devnagri (or Thai for James Clark :-) instead of the US ASCII word characters. As far as WG time, this could be prepared offline by email before the meeting so that it doesn't chew up WG time. regards, -Gunther -- Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx Associate Professor Indiana University School of Informatics Regenstrief Institute, Inc. Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] [XSLT2.0] xsl:analyze-str, Michael Kay | Thread | Re: [xsl] [XSLT2.0] xsl:analyze-str, Colin Paul Adams |
RE: FW: [xsl] Siblings, Karl Stubsjoen | Date | Re: [xsl] Space after <a> tag in te, David Carlisle |
Month |