Subject: Re: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx> Date: Tue, 03 Dec 2002 12:30:59 -0500 |
thanks for your interest. Yes, I am moving the whole approach over to XSLT/XPath and I'm almost done. Great thing is that you do in fact allow variables in xsl:template/@match patterns now, with that and the JDK 1.4 standard java.util.regex features I could get rid of my regex wrapper around ORO-matcher entirely.
I still do not see a good way how the XSLT analyze-string and the XPath regex functions fit into my scheme of things. And I'll think about this some more. So far I still believe I want to use a little stateful matcher, which the java.util.regex.Matcher fortunately gives me.
Here is an example. I pair up match patterns and templates one for one. No two templates use the same matcher object -- that way I believe I'm safe w/r/t side-effects and parallelism.
<xsl:variable name="header-pattern" select="'^([fF]rom|[tT]o|[cC]c|[sS]ubject): (.*)\n'"/>
<xsl:variable name="header-matcher" xmlns:p="java:java.util.regex.Pattern" select="p:matcher(p:compile($header-pattern),'')"/>
<xsl:template xmlns:m="java:java.util.regex.Matcher" match="text()[m:looking-at(m:reset($header-matcher,.))]"> <xsl:element name="{lower-case(m:group($header-matcher, 1))}" namespace=""> <xsl:value-of select="m:group($header-matcher, 2)"/> </xsl:element> <xsl:variable name="rest"> <xsl:value-of select="substring(.,m:end($header-matcher)+1)"/> </xsl:variable> <xsl:apply-templates select="$rest/text()"/> </xsl:template>
<xsl:template match="text()"> <rest> <xsl:value-of select="."/> </rest> </xsl:template>
it only uses one pattern-template pair. I have a few diffs to SAXON that I'll send to you under separate cover to make this possible. Basically they add the new CharSequence of Java into the set of Java types considered for conversion. It's easy, but it raises some opportunities for performance improvement with string and text handling in general.
> Interesting approach. Generally, creating nodes is expensive. It also > requires a lot of specification work to sort out the detail, e.g. what > is the parent of the node, what is its base URI, do you get a new text > node each time or can the system reuse them? I think a mechanism based > on strings (like xsl:analyze-string) is more flexible than one based > on text nodes.
I share your concern. I am not comfortable with the amount of string garbage that my method probably produces right now. But that could be helped with some screwing under the hood :-)
- the first reason why I construct text nodes is because I can't xsl:apply-templates on a string or other atomic data type. Why? To me it would make sense to consider apply-templates on an atom as implicitly on a singleton sequence of those atoms.
- the secon reason why I construct text nodes is to return an unparsed rest from a template (return from a "parse-down") or to feed back into the recursion ("parse-along").
Text nodes would not have to be expensive at all, however. Here is where CharSequences come in. Instead of String, one should perhaps use CharSequence throughout. That way you would never copy the string data itself, all you'd do is pass along those little offset-length pairs. So, apart from object creation, this type of string handling would be quite cheap.
So, I'd say that if Saxon would underpin the XPath string and text data types with CharSequence type of offset-length pairs rather than copying java.lang.String data, there would be no big penalty in text node creation and hence no changes would be necessary to the rules of what can and cannot be given to apply-templates.
Of course this assumes that you don't make changes to the string data, such as with some regex replace thing. Well, if you do, then you need a copy-on-write hook to then copy out the data block. For parsing, you don't need to modify text at all (I construct new text), so, I don't care too much how that's solved.
So, I believe that given that this sting/text underpinning and the tail recursion of apply-template are implementation issues, and given the match pattern variable fix, there is only one thing left that I'd need:
b) a meachanism to fail a template and try the next
eligible template.
This is a "could" in the XSLT 2.0 requirements list and we've just started reviewing whether to do anything about this, so any use cases will be welcome - send them please to public-qt-comments@xxxxxx
... well, now that I'm redoing the whole thing again, it looks like it could work without that. It is good to discuss these things with people.
What remains is the question how I could use xsl:analyze-string. I think I can't and here is why: analyze-string is basically a tokenizer and it can only match one thing against one pattern. It does not allow different actions for different patterns (the way AWK does) and it does not provide for a matching expression to decide which part of the string to consume based on decisions outside the regex pattern.
So, thanks to your feedback, Michael, I will be able to boil my thing down and distill the real remaining issues.
I would also add that general-purpose parsing (like, writing a COBOL compiler in XSLT) was not really the application we had in mind. The real test is whether the facilities are adequate to analyze the structure found in the text of typical data files. I've used them for "screen-scraping" data downloaded in HTML and found them quite workable, though it needed several passes.
thanks much, -Gunther
-- Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] Regex-Enabled XSLT is Pos, Michael Kay | Thread | [xsl] Java exception handling in XS, Gunther Schadow |
[xsl] Re: Reusable XSLT templates, Dimitre Novatchev | Date | Re: [xsl] qualitative decline of xs, Michael H. Semcheski |
Month |