Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Tue, 8 Jan 2002 15:13:59 +0000
David,

>> Most regular expression languages don't find overlapping matches,
>> do they? It seems to add a lot of extra complexity if they do.
>
> No, but then they don't return a list of all matches either.

Some do, if it's a global match. From some JScript documentation:

 "If the global flag (g) is not set, Element zero of the array
  contains the entire match, while elements 1 ? n contain any
  submatches that have occurred within the match.... If the global
  flag is set, elements 0 - n contain all matches that occurred."

> In Xpath you can't do that. So a replace function that only lets you
> replace one set of unstructured input by some more unstructured
> output is not particularly useful.

I agree with your analysis about regexp replace in general, though
it's not altogether useless - when global, at least it goes some way
towards helping with the classic multi-string-replacement problem. For
example, to escape newline characters with "\n", tabs with "\t" and
carriage returns with "\r":

  replace(replace(replace($string, '&#xA;', '\n'),
                  '&#x9;', '\t'),
          '&#xD;', '\r')

(or more manageably with a simple mapping operator:

  $string -> replace(., '&#xA;', '\n')
          -> replace(., '&#x9;', '\t')
          -> replace(., '&#xD;', '\r')

Sorry, couldn't resist.)

But as you've illustrated this doesn't help with the other classic in
this genre, which is replacing &#xA; characters with <br /> elements.

> If however the match function returned the sequence of substrings
> matched or equivalently a sequence of the match positions, then the
> string could be broken up and nodes added as required.

I think that you need a sequence of match positions *and lengths* in
the latter case, to make it possible to pull out the matched string?

Hmm... can't helping thinking that these flat sequences are going to
processing quite difficult - extracting a list of the matched strings
from the sequence would mean:

  for $i in (1 to count($matches) div 2)
  return substring($string, $matches[$i], $matches[$i + 1])

or a recursive function, neither of which is particularly practical.

On the other hand, I think it's impossible to reliably go from the
matched subexpression string to the location of the subexpression
within the original string.
  
> Actually it might be interesting (and more in the xpath style) to
> allow omnimark style named variable binding (the found-text in the
> above) within the serach string which would then be accessed by
> normal xpath xpath variable reference, $found-text, in any functions
> triggered by the replacement code.

You *could* do this implicitly by setting the variables $1..$N, since
authors cannot set these variables themselves (invalid names). But
either seems a bit messy to me - how do you define the scope, for one
thing?

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread