Re: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT

Subject: Re: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT
From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 03 Dec 2002 12:30:59 -0500
Hi Michael,

thanks for your interest. Yes, I am moving the whole approach over
to XSLT/XPath and I'm almost done. Great thing is that you do in
fact allow variables in xsl:template/@match patterns now, with
that and the JDK 1.4 standard java.util.regex features I could get
rid of my regex wrapper around ORO-matcher entirely.

I still do not see a good way how the XSLT analyze-string and
the XPath regex functions fit into my scheme of things. And I'll
think about this some more. So far I still believe I want to
use a little stateful matcher, which the java.util.regex.Matcher
fortunately gives me.

Here is an example. I pair up match patterns and templates one
for one. No two templates use the same matcher object -- that
way I believe I'm safe w/r/t side-effects and parallelism.

Here is a little example that parses email headers:

  <xsl:variable name="header-pattern"
    select="'^([fF]rom|[tT]o|[cC]c|[sS]ubject): (.*)\n'"/>

  <xsl:variable name="header-matcher"
    xmlns:p="java:java.util.regex.Pattern"
    select="p:matcher(p:compile($header-pattern),'')"/>

  <xsl:template xmlns:m="java:java.util.regex.Matcher"
      match="text()[m:looking-at(m:reset($header-matcher,.))]">
    <xsl:element name="{lower-case(m:group($header-matcher, 1))}"
         namespace="">
      <xsl:value-of select="m:group($header-matcher, 2)"/>
    </xsl:element>
    <xsl:variable name="rest">
      <xsl:value-of select="substring(.,m:end($header-matcher)+1)"/>
    </xsl:variable>
    <xsl:apply-templates select="$rest/text()"/>
  </xsl:template>

  <xsl:template match="text()">
    <rest>
      <xsl:value-of select="."/>
    </rest>
  </xsl:template>

it only uses one pattern-template pair. I have a few diffs to
SAXON that I'll send to you under separate cover to make this
possible. Basically they add the new CharSequence of Java into
the set of Java types considered for conversion. It's easy,
but it raises some opportunities for performance improvement
with string and text handling in general.

That ties in with your issue with text nodes:

> Interesting approach. Generally, creating nodes is expensive. It also
> requires a lot of specification work to sort out the detail, e.g. what
> is the parent of the node, what is its base URI, do you get a new text
> node each time or can the system reuse them? I think a mechanism based
> on strings (like xsl:analyze-string) is more flexible than one based
> on text nodes.

I share your concern. I am not comfortable with the amount of
string garbage that my method probably produces right now. But
that could be helped with some screwing under the hood :-)

Here are some ideas:

- the first reason why I construct text nodes is because I can't
  xsl:apply-templates on a string or other atomic data type. Why?
  To me it would make sense to consider apply-templates on an
  atom as implicitly on a singleton sequence of those atoms.

- the secon reason why I construct text nodes is to return an
  unparsed rest from a template (return from a "parse-down")
  or to feed back into the recursion ("parse-along").

Text nodes would not have to be expensive at all, however. Here
is where CharSequences come in. Instead of String, one should
perhaps use CharSequence throughout. That way you would never
copy the string data itself, all you'd do is pass along those
little offset-length pairs. So, apart from object creation, this
type of string handling would be quite cheap.

So, I'd say that if Saxon would underpin the XPath string and text
data types with CharSequence type of offset-length pairs rather
than copying java.lang.String data, there would be no big penalty
in text node creation and hence no changes would be necessary to
the rules of what can and cannot be given to apply-templates.

Of course this assumes that you don't make changes to the string
data, such as with some regex replace thing. Well, if you do,
then you need a copy-on-write hook to then copy out the data
block. For parsing, you don't need to modify text at all (I
construct new text), so, I don't care too much how that's
solved.


So, I believe that given that this sting/text underpinning and the tail recursion of apply-template are implementation issues, and given the match pattern variable fix, there is only one thing left that I'd need:


b) a meachanism to fail a template and try the next
eligible template.


This is a "could" in the XSLT 2.0 requirements list and we've just
started reviewing whether to do anything about this, so any use cases
will be welcome - send them please to public-qt-comments@xxxxxx


... well, now that I'm redoing the whole thing again, it looks like
it could work without that. It is good to discuss these things
with people.


What remains is the question how I could use xsl:analyze-string. I think I can't and here is why: analyze-string is basically a tokenizer and it can only match one thing against one pattern. It does not allow different actions for different patterns (the way AWK does) and it does not provide for a matching expression to decide which part of the string to consume based on decisions outside the regex pattern.

So, thanks to your feedback, Michael, I will be able to boil
my thing down and distill the real remaining issues.

Finally you say:

I would also add that general-purpose parsing (like, writing a COBOL
compiler in XSLT) was not really the application we had in mind. The
real test is whether the facilities are adequate to analyze the
structure found in the text of typical data files. I've used them for
"screen-scraping" data downloaded in HTML and found them quite workable,
though it needed several passes.

I agree to an extent. Parsing truly formal language I would do differently. I'm sure XSLT will suit that purpose just fine, but with a formal language I can use pure pushdown-automata with formally specified grammar and a compiler into state-transition-
action tables and all that.


I have done screen scraping with YACC in the past and found that, while I could make it work, it is nothing that you can let people do who are more on the IT maintenance level. They need something that makes what they do simpler.

We have dozens of text report types from all sorts
of places. Some of these reports change over night, some are
never really structurally controlled. This requires an approach
where a "grammar" can be specified fairly simply by people who cannot speak BNF and who could not write even a simple parser themselves. (I'm always suprized how little IT people know about parsers -- which, I might point out, is not a reason to be rude though.)


thanks much,
-Gunther


-- Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org



XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list


Current Thread