Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)

Subject: Re: Regular expression functions (Was: Re: [xsl] comments on December F&O draft)
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Wed, 9 Jan 2002 22:31:46 +0000
Hi Steven,

Very interesting :)

Could you explain a little more about how the matchers work? You call
them by name - does each of them search over the entire string, or do
later matchers only match on what's left after matching the earlier
ones? Did you try any other designs? What made you choose this one?

> One of the things which doesn't work well currently is the
> specification of the regex as an attribute to the <matcher> element.
> We will avoid this by putting the regex inside a CDATA section of a
> <regex> subelement (will be optional, we are testing this right
> now). Not sure whether this is good practice, advice welcome. It is
> only partially related to this discussion of course.

I can see why you'd want to do that, given that you're matching HTML
tags. Note that you're doing more escaping than you have to in the
attribute value, though. Consider:

<matcher
regex="CLASS=&quot;story3&quot;&gt;([^&lt;]+)&lt;BR&gt;&lt;/SPAN&gt;&lt;
/FONT&gt;&lt;/STRONG&gt;&lt;FONT\sCOLOR=&quot;#333333&quot;\sFACE=&quot;
sans-serif,\sarial&quot;&gt;&lt;SPAN\sCLASS=&quot;story&quot;&gt;([^&lt;
]+)&amp;nbsp;(.+)&lt;A\sHREF=&quot;([^&quot;]+)&quot;&gt;More"
name="items">

The greater-than signs don't have to be escaped in attribute values
(they only have to be escaped if they occur in the sequence ]]> in
element content). And you could avoid escaping double-quotes if you
delimited the attribute with single-quotes. So you could have:

<matcher
regex='CLASS="story3">([^&lt;]+)&lt;BR>&lt;/SPAN>&lt;/FONT>&lt;/STRONG>
&lt;FONT\sCOLOR="#333333"\sFACE="sans-serif,\sarial">&lt;SPAN\sCLASS="s
tory">([^&lt;]+)&amp;nbsp;(.+)&lt;A\sHREF="([^"]+)">More'
name="items">

But I agree - if you've got regular expressions like this, it's best
to put them in an element where you can use CDATA sections to at least
make it look like the stuff you're matching.

For XSLT, I think that attributes are more natural because attributes
are used for this kind of thing elsewhere (matching nodes, for
instance). It would be handy if the regular expressions could be held
in (global) variables because then they could be defined in content
(with CDATA sections) rather than in an attribute. However, that would
run up against the dynamic regular expression problem that David and I
talked about yesterday. I don't think it'll be too big a problem,
though - the regular expressions in XSLT are likely to be a lot
smaller than these, and not include tags (hopefully!).

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread