Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft) From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx> Date: Sat, 12 Jan 2002 16:07:15 +0000 |
Hi Marc, > thinking about doing it... the reality of the regex engine is that > while you would like the tree to appear somewhere in memory to > select on, the way the regex engine will offer you the matchresult > will be more in an array like format. Yes, but when you talk about "the regexp engine" you're talking about a specific implementation with a specific API. I don't see any reason why a different implementation with a different API shouldn't offer you a different view. And I'd argue that we're going to need regular expression engines specific to XSLT anyway (for support in XSLT, I know that your concerns aren't necessarily the same) - regular expression engines for XML Schema regular expressions aren't going to cover everything we need for XSLT (e.g. start-string and end-string meta characters), and other regexp engines won't support the XML Schema stuff. But even having said that, perhaps you can get around it with some jiggery-pokery. Say you had $number defined as the regular expression: ([0-9]+)(\.([0-9]+))? and you had a regular expression of: [+-]?($number)([Ee]([+-][0-9]+))? Matching against: -12.34E+5 If you simply substitute in the $number expression to get a result expression: [+-]?(([0-9]+)(\.([0-9]+))?)([Ee]([+-][0-9]+))? Then run the match on that. The result of the match could come in pairs of start/end integers, giving the position of the start and end of the match as a whole and each subexpression, something akin to (assuming counting starts from 0): ( (0, 9), (1, 6), (1, 3), (4, 6), (6, 9), (7, 9), (8, 9) ) You can tell whether one subexpression is 'inside' another subexpression by the fact that its start position is equal to or greater than the start position of the parent position, and its end position is equal to or less than the start position of the parent position. So from that you should be able to construct the tree: <rxp:match> - <rxp:match> <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match> </rxp:match> <rxp:match>E<rxp:match>+5</rxp:match></rxp:match> </rxp:match> You can work out which of these rxp:match elements should actually be a 'number' element by counting the number of open-brace characters before the variable reference in the regular expression. In the case of: [+-]?($number)([Ee]([+-][0-9]+))? it's the first one, so the first rxp:match element (inside the global match) should be a number element instead: <rxp:match> - <number> <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match> </number> <rxp:match>E<rxp:match>+5</rxp:match></rxp:match> </rxp:match> I'll note that the *big* drawback with this algorithm is that it doesn't take into account nested subexpressions. Say you had something like: $para => \\para\{($phrase)\} $phrase => (($bold)|($italic)|([^\\]*))* $bold => \\bold\{($phrase)\} $italic => \\italic\{($phrase)\} to match a string like: \para{\italic{this} is \bold{bold \italic{and italic}} text.} This is essentially David's problem. You can't just expand all the variable subexpressions because they'd go on for ever. But without expanding all the variable subexpressions, you can't successfully backtrack if you suddenly find that a variable subexpression isn't in fact matched. For example, if you had: $baz => \\baz\{(($barfoo)|($barbar))\} $barbar => \\bar\{($bar)\} $barfoo => \\bar\{($foo)\} $foo => foo And a string like: \baz{\bar{\bar{foo}} We want to get: <baz> \baz{ <barbar>\bar{<barfoo>\bar{<foo>foo</foo>}</barfoo>}</barbar> } </baz> We can start off by matching the string against $baz and working out that to match it, then the regular expression: (($barfoo)|($barbar))\} must match the string: \bar{\bar{foo}}} and that means that the following regular expression must match the string: ((\\bar\{($foo)\})|(\\bar\{($bar)\}))\} but there's no way to tell whether the \bar at the beginning of the string should match the \\bar at the beginning of $barfoo or of $barbar. That might be an acceptable limitation, or it might be something that can be worked around, I'm not sure. I need to think about it some more. > the feeling I currently have is that this nested-regex to tree > vision directly has the power of circumventing the need for the > nested matcher structure alltogether... meaning with this kind of > tree view on the regex-match-result one can hand it over to the > regular-and-known xsl-node-juggling, I guess. Quite - that's what I think makes it so powerful. If you get a tree like the above, you could apply templates to it and so on. You have one big regular expression that gets directly translated into a tree, which you can then process using XSLT as usual. >> I should note that nothing existing in XPath or XSLT automatically >> creates a tree in this way. > > And this would be an argument to keep the nested-matcher approach. > They behave more like the known templates that do create trees I > guess? Hey, why limit yourself to only doing things in the way that they're done now?!? :) But I think I'm still not very clear on your nested matcher approach. Do you think you could use the examples above to illustrate how your nested matcher would work? Cheers, Jeni --- Jeni Tennison http://www.jenitennison.com/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: Regular expression functions (W, Marc Portier | Thread | Re: Regular expression functions (W, Jeni Tennison |
Re: [xsl] xalan error with plain t, Jeni Tennison | Date | [xsl] Reverse order axis operators., lists |
Month |