Re: [xsl] Re: A question about the expressive power and limitations of XPath 2.0

Subject: Re: [xsl] Re: A question about the expressive power and limitations of XPath 2.0
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sun, 13 Jan 2002 15:14:07 +0000
Hi David,

> I think that there are three separate problems that might be addressed:
>
> 1) defficiencies in the regular expression syntax/semantics.
>    This may or may not include lack of ^ and $ to match start and end of
>    expression or perl style {2} repeat clauses. (Mainly it's hard to
>    know what's there now as the text is a bit underspecified, hence my
>    "overlapping regexp" question) 

There are perl style {2} repeat clauses in the XML Schema regular
expression language (http://www.w3.org/TR/xmlschema-2/#regexs) which I
is what XPath 2.0 will be using (and which is why I've been escaping
my {s).

In the suggestion that I made, you wouldn't really need ^ and $,
actually, because the static regexps always test the *entire string*
(there's an implicit ^ and $) and if you use the tokenize() function
you can always test whether the first string is '' (in which case it
starts with your regular expression) and/or if there are only two
items in your list (in which case it ends with the regular
expression).

However, I suspect that test(), match() and replace() functions will
still be specified, and those do need ^ and $ to make them useful, I
think.

> 2) Possibilities for doing tree generation as well as string generation
>    once the match is found. (Note this is purely an XSLT construction
>    issue it doesn't affect the languages you accept, only what you can
>    do with them). This is where I came in with the regexp matching
>    template mechanism, and you've extended in various ways with named
>    subexpression possibilities.

Yah. I think the named subexpressions are overkill :) I like the stuff
that I wrote this morning a lot better.

But the current-match() function could still give a tree
representation of the match using rxp:match or whatever elements, as I
suggested in a message to Marc recently.

I don't know whether it's worth it - I kinda like the tree access 'cos
it's easy to address and process trees, it'd be nicely expandable into
named subexpressions some day, and I think it helps with regular
expressions where there are lots of brackets that you really don't
care about. On the other hand, it's a departure from what you get in
other environments, so people used to Perl/emacs/sed or whatever might
not like it.

What do you think?

We would have to address here the problem that Marc pointed out to do
with how repeated subexpressions are captured...

> 3) possibilities for accepting non regular languages in input strings.
>    three examples given so far in this thread, nested {} pairs,
>    html nested elements tag syntax, the classic non regular example
>    of a string consisting of a and b with as many a as b.

I talked about the first two in what I wrote this morning. For the
latter, I think you could use tokenize() to split the string up in two
different ways: once on a and once on b, filter out the odd strings,
and then compare the lengths of the two sequences:

  tokenize('abbaab', 'a')
    => ('', 'a', 'bb', 'a', '', 'a', 'b')

  ('', 'a', 'bb', 'a', '', 'a', 'b')[position() mod 2 = 0]
    => ('a', 'a', 'a')
    
  tokenize('abbaab', 'b')
    => ('a', 'b', '', 'b', 'aa', 'b')

  ('a', 'b', '', 'b', 'aa', 'b')[position() mod 2 = 0]
    => ('b', 'b', 'b')

  count(('a', 'a', 'a')) = count(('b', 'b', 'b'))
    => true

I think that's a reasonable series of hoops to go through - the full
expression is only:

  count(tokenize($string, 'a')[position() mod 2 = 0]) =
  count(tokenize($string, 'b')[position() mod 2 = 0])

It'd be nice to have even() and odd() functions :)

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread