RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: RE: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: "Chris Bayes" <chris@xxxxxxxxxxx>
Date: Mon, 14 Jan 2002 15:02:48 -0000
 
> 
> Chris,
> 
> > I've been a bit tied up with one thing and another (and I think you 
> > might have discussed this before) but aren't regex matches just 
> > predicates on text nodes ala <xsl:template match="text()['\(.*\)']">
> >         <x><xsl:apply-templates select=".[1]" /></x>
> > </xsl:template>
> > Which applies templates to whatever is not matched (child 
> texts) (but
> > which matches the template).
> 
> Not all strings that you might deal with are text nodes, so I 
> think that you need to provide something that allows you to 
> match other strings as well. Indeed, your example above 
> demonstrates this - when you do .[1], then presumably you're 
> applying templates to the matched substring of the current 
> text node. I think that there are three possibilities:
> 
>   - assume that when you apply templates to a string, it's
>     automatically converted to a text node, and apply templates to
>     that
>   - open up normal templates so that they can match things other than
>     nodes

What is wrong with that? A template that matches text is pretty much the
end of the line anyway.

>   - introduce specific regexp templates
> 
> > So that template on a text node
> > "(a(b(c)d)e)" (assuming greedy)would produce
> > <x>
> >   a 
> >   <x>
> >     b
> >     <x>
> >      c
> >     </x>
> >     d
> >   </x>
> >   e
> > </x>
> 
> Unfortunately, assuming greedy, (a)(b) would produce:
> 
>   <x>a)(b</x>
> 
Yeh but it doesn't have to be greedy.

<xsl:template match="\((.*?)\)(.*)">
	<x><xsl:apply-templates select=".[1]" /></x>
	<xsl:apply-templates select=".[2]" />
</xsl:template>
Or
<xsl:template match="\((.*?)\)">
	<x><xsl:apply-templates select=".[1]" /></x>
	<xsl:apply-templates select="$'" />
</xsl:template>

> which is probably not what you want. This is why I suggested 
> the bracket-balancing tokenize() function. For example, you'd have:
> 
>   <xsl:apply-regexp-templates select="'(a(b(c)(d))e)'" />
> 
> and then:
>   
> <xsl:regexp-template match="\((.*)\)">
>   <x>
>     <xsl:apply-regexp-templates
>       select="tokenize(current-match()[1], '\(', '\)')" />
>   </x>
> </xsl:regexp-template>
> 
> would give:
> 
>  <x>a<x>b<x>c</x><x>d</x></x>e</x>
> 
> > Maybe it's rubbish but it doesn't look too alien to me. What other 
> > useful predicates can you put on a text node?
> 
> Commonly, I'd guess:
> 
>   text()[1]
>   text()[normalize-space()]
>   text()[starts-with(., 'foo')]
>   text()[contains(., 'foo')]
> 
> The second one is the one that would clash with what you're 
> suggesting (where any string used as the predicate to a text 
> node acts as an implicit regexp test on the value of the text node).

Yeh but they are integers or booleans except 2 which would be false for
<x>a  b</x> hmmmm
> 
> But you could always have a test() function that does the 
> test explicitly instead:
> 
>   text()[test('\(.*\)')]
> 
> Or the other option is to have a special syntax to refer to a 
> regular expression, 

You mean like text()['regexp']
Which can't be confused with text()[normalize-space()]

> or even to make regular expressions first 
> class objects.
> 
> > Surely it isn't going to clash with anything. There are nearly 1000 
> > pages of wd's to look at here so looking at it another way is there 
> > anything that says that . can't be a sequence and that I 
> can't index 
> > into it with .[x]?
> 
> . is defined as being the context item (or a singleton 
> sequence containing the context item, 

Which it would be for a node but for a regex it wouldn't be.

> depending on how you 
> want to view it), so logically .[2] should never return 
> anything. Currently, as in XPath 1.0, . is an abbreviated 
> step and cannot take any StepQualifiers (which includes predicates).
> 
> The way I (and I think David) was thinking, you'd use 
> current-match() or some other function to get information 
> about the subexpression matches when you were inside the 
> template. So perhaps:
> 
>   current-match()[x]
> 
> rather than .[x].

Well if you like typing ;-)

Ciao Chris

XML/XSL Portal
http://www.bayes.co.uk/xml


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread