Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sat, 12 Jan 2002 16:07:15 +0000
Hi Marc,

> thinking about doing it... the reality of the regex engine is that
> while you would like the tree to appear somewhere in memory to
> select on, the way the regex engine will offer you the matchresult
> will be more in an array like format.

Yes, but when you talk about "the regexp engine" you're talking about
a specific implementation with a specific API. I don't see any reason
why a different implementation with a different API shouldn't offer
you a different view.

And I'd argue that we're going to need regular expression engines
specific to XSLT anyway (for support in XSLT, I know that your
concerns aren't necessarily the same) - regular expression engines for
XML Schema regular expressions aren't going to cover everything we
need for XSLT (e.g. start-string and end-string meta characters), and
other regexp engines won't support the XML Schema stuff.

But even having said that, perhaps you can get around it with some
jiggery-pokery. Say you had $number defined as the regular expression:

  ([0-9]+)(\.([0-9]+))?

and you had a regular expression of:

  [+-]?($number)([Ee]([+-][0-9]+))?

Matching against:

  -12.34E+5

If you simply substitute in the $number expression to get a result
expression:

  [+-]?(([0-9]+)(\.([0-9]+))?)([Ee]([+-][0-9]+))?

Then run the match on that. The result of the match could come in
pairs of start/end integers, giving the position of the start and end
of the match as a whole and each subexpression, something akin to
(assuming counting starts from 0):

  ( (0, 9), (1, 6), (1, 3), (4, 6), (6, 9), (7, 9), (8, 9) )

You can tell whether one subexpression is 'inside' another
subexpression by the fact that its start position is equal to or
greater than the start position of the parent position, and its end
position is equal to or less than the start position of the parent
position. So from that you should be able to construct the tree:

  <rxp:match>
    -
    <rxp:match>
      <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match>
    </rxp:match>
    <rxp:match>E<rxp:match>+5</rxp:match></rxp:match>
  </rxp:match>

You can work out which of these rxp:match elements should actually be
a 'number' element by counting the number of open-brace characters
before the variable reference in the regular expression. In the case
of:

  [+-]?($number)([Ee]([+-][0-9]+))?

it's the first one, so the first rxp:match element (inside the global
match) should be a number element instead:

  <rxp:match>
    -
    <number>
      <rxp:match>12</rxp:match>.<rxp:match>34</rxp:match>
    </number>
    <rxp:match>E<rxp:match>+5</rxp:match></rxp:match>
  </rxp:match>

  
I'll note that the *big* drawback with this algorithm is that it
doesn't take into account nested subexpressions. Say you had something
like:

  $para   =>  \\para\{($phrase)\}
  $phrase =>  (($bold)|($italic)|([^\\]*))*
  $bold   =>  \\bold\{($phrase)\}
  $italic =>  \\italic\{($phrase)\}

to match a string like:

  \para{\italic{this} is \bold{bold \italic{and italic}} text.}

This is essentially David's problem. You can't just expand all the
variable subexpressions because they'd go on for ever. But without
expanding all the variable subexpressions, you can't successfully
backtrack if you suddenly find that a variable subexpression isn't in
fact matched. For example, if you had:

  $baz     =>  \\baz\{(($barfoo)|($barbar))\}
  $barbar  =>  \\bar\{($bar)\}
  $barfoo  =>  \\bar\{($foo)\}
  $foo     =>  foo

And a string like:

  \baz{\bar{\bar{foo}}

We want to get:

  <baz>
    \baz{
    <barbar>\bar{<barfoo>\bar{<foo>foo</foo>}</barfoo>}</barbar>
    }
  </baz>

We can start off by matching the string against $baz and working out
that to match it, then the regular expression:

  (($barfoo)|($barbar))\}

must match the string:

  \bar{\bar{foo}}}

and that means that the following regular expression must match the
string:

  ((\\bar\{($foo)\})|(\\bar\{($bar)\}))\}

but there's no way to tell whether the \bar at the beginning of the
string should match the \\bar at the beginning of $barfoo or of
$barbar.

That might be an acceptable limitation, or it might be something that
can be worked around, I'm not sure. I need to think about it some
more.

> the feeling I currently have is that this nested-regex to tree
> vision directly has the power of circumventing the need for the
> nested matcher structure alltogether... meaning with this kind of
> tree view on the regex-match-result one can hand it over to the
> regular-and-known xsl-node-juggling, I guess.

Quite - that's what I think makes it so powerful. If you get a tree
like the above, you could apply templates to it and so on. You have
one big regular expression that gets directly translated into a tree,
which you can then process using XSLT as usual.

>> I should note that nothing existing in XPath or XSLT automatically
>> creates a tree in this way.
>
> And this would be an argument to keep the nested-matcher approach.
> They behave more like the known templates that do create trees I
> guess?

Hey, why limit yourself to only doing things in the way that they're
done now?!? :)

But I think I'm still not very clear on your nested matcher approach.
Do you think you could use the examples above to illustrate how your
nested matcher would work?

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread