RE: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT

Subject: RE: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT
From: "Michael Kay" <michael.h.kay@xxxxxxxxxxxx>
Date: Tue, 3 Dec 2002 10:42:44 -0000
Comments attached...
> 
> 
> EXECUTIVE SUMMARY
> 
> I start with my conclusions:
> 
> 1) regex-enabled templates are possible in XSLT 1.0 today (with
>     the use of Java extensions, as possible in Saxon or Xalan.)
> 
> 2) the features we need in future XSLT which would make this
>     more useful are:
> 
>     a) variables in xsl:template/@match patterns (which is
>        currently not allowed.)

They are allowed in XSLT 2.0
> 
>     b) a meachanism to fail a template and try the next
>        eligible template. (This turns out to be the most
>        critical feature for making XSLT work for a reasonably
>        powerful "up-translator".)

This is a "could" in the XSLT 2.0 requirements list and we've just
started reviewing whether to do anything about this, so any use cases
will be welcome - send them please to public-qt-comments@xxxxxx
> 
>     c) extend the XSLT processing model with some tail recursion
>        elimination or add a built-in feature for tokenizing
>        text nodes. (May already be provided in Saxon, may be
>        just an implementation issue.)

My feeling is that tail-recursion is an implementation issue, though I
know that some FP languages essentially mandate that implementatons
support it.

Saxon (incidentally) never does tail recursion of an apply-templates
call, it only does it for call-template. No good reason - I just never
thought of doing it.

> 
> 3) the new xsl:analyze-string funcitions and the XPath regex
>     support that has been developed in parallel may not be a
>     sufficient substitute for the method I am describing here.
>
It would be interesting to see use cases that demonstrate what the
limitations are. 
> 
> OVERVIEW OF THE APPROACH
> 
<snip/>
> 
> The typical processing model for parsing a text node after 
> xsl:apply-templates with a text node selected is to match the 
> head of the text node to a regular expression, consume the 
> matching head and generate a new text node that is the 
> unmatched tail. The tail is then selected in a recursive 
> xsl:apply-template statement.

Interesting approach. Generally, creating nodes is expensive. It also
requires a lot of specification work to sort out the detail, e.g. what
is the parent of the node, what is its base URI, do you get a new text
node each time or can the system reuse them? I think a mechanism based
on strings (like xsl:analyze-string) is more flexible than one based on
text nodes.
> 
> HOW DOES XSLT/XPath 2.0 REGEX SUPPORT HELP HERE?
> 
> On the surface, the new XPatch regex support would obsolete
> the ORO-Matcher and my regex wrapper object. However, the
> two functions that my wrapper served were:
> 
>     - keep a symbol table of regexes to avoid recompiling them

I think that's something that an implementation can easily do behind the
scenes.
> 
>     - keep a regex with an internal state (caching the last match)
>       to avoid frequent re-matching of the same text or pieces
>       of it

I'm not at all sure that this fits well into the functional programming
model.
> 
>     - allow these regex objects to appear in xsl:template/@match
>       patterns
> 
> Particularly if you add the new xsl:analyze-string form into 
> the mix, the need for these kinds of things may be entirely gone.
> 
> But, I keep coming back to the analogy of xsl:template 
> matching to regex pattern matching. Having the matching rules 
> handled by real XSLT templates with regex in the @match 
> pattern is quite intuitive and much more generally useful 
> than the simple tokenization that happens in the 
> analyze-string form. The analyze-string form can only test a 
> single regex, but in text parsing you need to try many 
> patterns against the current head of the unparsed text.

Can't this be handled fairly intuitively by using the fn:matches()
function in conjunction with xsl:analyze-string?

What I think would be really useful is if you wrote up your example use
case using the XSLT 2.0 / XPath 2.0 facilities, so that we could see
where the difficulties really are. At present, your note reads as if you
have decided on one design approach, and you are not really prepared to
consider reworking it to use the XSLT 2.0 constructs as they were
designed to be used.

I would also add that general-purpose parsing (like, writing a COBOL
compiler in XSLT) was not really the application we had in mind. The
real test is whether the facilities are adequate to analyze the
structure found in the text of typical data files. I've used them for
"screen-scraping" data downloaded in HTML and found them quite workable,
though it needed several passes.

Michael Kay
Software AG
home: Michael.H.Kay@xxxxxxxxxxxx
work: Michael.Kay@xxxxxxxxxxxxxx 


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread