RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: "Marc Portier" <mpo@xxxxxxxxxxxxxxxx>
Date: Sat, 12 Jan 2002 12:38:38 +0100
Hi Jeni,

> -----Original Message-----
> From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx]On Behalf Of Jeni Tennison
> Sent: vrijdag 11 januari 2002 12:25
> To: Marc Portier
> Cc: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: Re: Regular expression functions (Was: Re: [xsl] comments on
> December F&O draft)
>
>
> Hi Marc,
>
> > assume we have some :z: == (c1)(:x:){2} then the selection of index
> > x[2] would have no meaning, since there is only one x noted in the
> > regex
> >
> > and in normal regex behavior the numbered index 2 (2nd parenthesis) will
> > only hold the second occurence of the :x: matching part of the
> string... it
> > is as writing (c1):x:(:x:)
>
> That's an interesting point. Assuming x matched 'c2', then that would
> mean a structure of:
>
>   <z>
>     <rxp:match>c1</rxp:match>
>     c2
>     <x>c2</x>
>   </z>
>

(refering to the nested-regex vs nested-matcher discussion)
I should check it out, but I'm really afraid the matchresult-groups[] here
would actually be in the case of a (c1)(:x:){3} with :x: going for c2:
[0] c1c2c2c2
[1] c2	(the last of the 3)

and even the start-end positions would not be of more help... it's the regex
engines way of saying you should write it differently if you want it to
behave differently

getting it into
  <z>
    <rxp:match>c1</rxp:match>
    c2c2
    <x>c2</x>
  </z>

leaving litle xpath-natural feeling for getting to 1st or 2nd 'c2'... which
might be against natural xslt feelings?

and it only gets worse when adding {n,m} kind of things in there :-(

somewhere internally the regex engines need to know about the earlier
matches though... different notations only tell it, it can forget about
it...

> > this is how regexes are working I'm afraid... (other hand, the
> > notations :z: == (c1)(:x:)(:x:) and/or :z: == (c1)((:x:){2}) would
> > possibly tackle what you really need)
>
> Yes - with the second of these, you would get something like:
>
>   <z>
>     <rxp:match>c1</rxp:match>
>     <rxp:match>
>       c2
>       <x>c2</x>
>     </rxp:match>
>   </z>
>
> which would at least allow you to get the result of the two xs
> combined.

yep.

>
> > oh and by the way, I started of this :subregex: notation, based on bad
> > memory of long-past perl days
> > just opened some doc again, and understand now that it used to be the
> > [:name:] notation for the posix characters... with added
> possible stuff like
> > [:^name:] and the like
>
> Hmm... Perl uses that notation for named character classes. The
> equivalent in the XML Schema regular expression language is roughly:
>
>   \p(name)     (characters in the named class)
>   \P(name)     (characters not in the named class)
>
> That's a different kind of thing to what we're doing here (where the
> named expressions are complete regular expressions rather than
> character classes). I'd be tempted to introduce a different escape
> character to do it, for example e (for expression):
>
>   \e(name)     (the named subexpression)
>   \E(name)     (not the named subexpression, if that's appropriate?)
>

waw, great idea, sounds like something to propose/bounce off on some perl
mailinglist as well...

> So something like:
>
>   \e(mantissa)\e(exponent)?
>
> > revoking my own introduction: maybe $name makes more sense in any
> > case?
>
> Using $name in the regular expression might be confusing - you'd need
> to make sure you could detect the end of the name, so probably ($name)
> would be better. (I think that if $ is introduced as matching the end
> of the string then you could safely state that it only matched the end
> of the string if it was at the end of the regular expression.)
>
> So something like:
>
>   ($mantissa)($exponent)
>
> I'd suggest {$name}, but only if regular expression support wasn't
> ever available through functions (because {$name} looks a lot like an
> AVT, and would make people think that they could put AVTs in
> attributes that held expressions).
>
> If the references look like variable references then they should
> probably be set with variable-binding elements (e.g. xsl:variable).

yep, also assuming you read and go allong with the remark on parenthesis in
these variables to be
litterally matched as \( and \) ?
and thus keep these next to the regexnesting with \e()

>
> Cheers,
>
> Jeni
>
> ---
> Jeni Tennison
> http://www.jenitennison.com/
>
>
>  XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread