Subject: [xsl] Regex-Enabled XSLT is Possible -- Preliminary Results and Desiderata for future revisions of XSLT From: Gunther Schadow <gunther@xxxxxxxxxxxxxxxxxxxxxx> Date: Mon, 02 Dec 2002 14:04:11 -0500 |
Then I was actually trying to publish this somewhere, and I picked Dr. Dobbs, and that was a mistake. I held the discussion away from the list worried about prior publication issues, but the whole DDJ thing was a complete waste of time. The editors seem to be very disorganized, I never received a response, when I called the editor on the phone he said he was quite interested and would respond in a few days and that was the last thing I heard in months. Enough of that!
Now I am converting my stuff to XSLT 2.0 / XPath 2.0 and found that there had been some progress made on the regex front. So, may be it is too late now, but please consider the following description on what I did. It's a bit long, but it isn't an easy thing. Nevertheless XML up-translation keeps being an important issue that deserves the discussion. The good news is that XSLT is so very cool that one can do these things very nicely.
1) regex-enabled templates are possible in XSLT 1.0 today (with the use of Java extensions, as possible in Saxon or Xalan.)
2) the features we need in future XSLT which would make this more useful are:
a) variables in xsl:template/@match patterns (which is currently not allowed.)
b) a meachanism to fail a template and try the next eligible template. (This turns out to be the most critical feature for making XSLT work for a reasonably powerful "up-translator".)
c) extend the XSLT processing model with some tail recursion elimination or add a built-in feature for tokenizing text nodes. (May already be provided in Saxon, may be just an implementation issue.)
3) the new xsl:analyze-string funcitions and the XPath regex support that has been developed in parallel may not be a sufficient substitute for the method I am describing here.
In XSLT 1.0, I had successfully devised a method to do some powerful regular expression based parsing of free text in XSLT using an XSLT processor that allows Java extension function to be called (e.g., Saxon). I created a very simple wrapper around the Apache/jakarta ORO regular expression package, imported this wrapper and directly some ORO classes into XSLT. Given this modest prep-work I can use regexes in xsl:template/ @match patterns. Once XSLT selects a template, in the body of the template I can refer to parenthesis-subexpressions (aka "groups") without having to call the regex matcher twice.
The typical processing model for parsing a text node after xsl:apply-templates with a text node selected is to match the head of the text node to a regular expression, consume the matching head and generate a new text node that is the unmatched tail. The tail is then selected in a recursive xsl:apply-template statement.
Here I show actual code examples. Notice that all code presented here is Copyright (c) 2002 Regenstrief Institute. All rights reserved. And I assert the GNU Public License for this code.
I start my XSL transforms with the following header, notice the xmlns:regex and xmlns:match elements.
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:regex="java:regex" xmlns:match="java:org.apache.oro.text.regex.MatchResult" xmlns:saxon="http://icl.com/saxon" xmlns:exsl="http://exslt.org/common" extension-element-prefixes="saxon exsl" exclude-result-prefixes="regex match saxon">
Appart from the ease of integrating Java extensions in saxon, I only need the saxon extension element node-set because the exsl:node-set does not allow strings to be turned into text nodes whereas saxon:node-set does.
[This node-set issue was necessary in XSLT 1.0 but is now *largely* irrelevant in XSLT 1.1 and XSLT 2.0. However, there is still sometimes the need to convert a string into a text node, which I don't know right now how to do.]
The match extension directly uses the jakarta ORO class MatchResult. The regex extension links to a simple wrapper whose purpose is to:
- Hide the prep-work for setting up the ORO matcher from XSLT to express a simple API to the XSLT.
- Provide a hash map of regex pattern names and objects that contain a regex pattern string, and ORO Pattern object, and the last MatchResult.
Since XSLT doesn't allow variables in xsl:template/@match patterns, I need to refer to templates symbolically. My template-wrapper has a symbol-table mapping regex-name to a compiled regex wrapper object. These regex objects are stateful and remember their last match.
<xsl:template mode="..." match="text()[regex:match-by-name('my-pattern',.)]">
the function regex:match-by-name(pattern-name, string) refers to a regex object that has previously been set up as follows:
<xsl:variable name="my-pattern" select="regex:new('my-pattern', '^...')"/>
where '^...' is the regex pattern string accoring to ORO's Perl5 regex (the most powerful ones.) If the object is to parse the whole text node, I usually want the pattern anchored at the start of the string.
Notice that I use the same name for the XSLT variable of that regex object and for the name of the regex (managed in my little wrapper's regex symbol table), just to keep things understandible.
I have to use that stored name in the regex:match-by-name function because XSLT template match patterns cannot have variables in them. Otherwise I might have preferred to start the template like this:
<xsl:template mode="..." match="text()[regex:match($my-pattern,.)]">
I still want to create these regex objects that store the regex pattern and the last match, in order to avoid double execution of the matcher when we get into the template body.
Once I have a match, in the template-body I usually want to extract sub-strings matched by the regex patterns. To reduce clutter, the first thing I do is put an XSLT variable handle on the match:
<xsl:variable name="this-match" select="regex:get-match($my-pattern)"/>
The regex:get-match function returns an ORO MatchResult object that we prefix with 'match:' here. So, I can now use the whole matching text as an XSLT text node by:
<xsl:variable name="this-text" select="saxon:node-set(match:group($this-match, 0))"/>
<xsl:variable name="group-1" select="saxon:node-set(match:group($this-match, 1))"/> <xsl:variable name="group-2" select="saxon:node-set(match:group($this-match, 2))"/> ... <xsl:variable name="group-n" select="saxon:node-set(match:group($this-match, $n))"/>
Of course I give names to the group variables that are more descriptive about my interpretation of this text. Those variables can then be conveniently used to construct the XSL structure describing the parsed thing.
After one match is found, we typically want to continue parsing the rest. It would be kind-of neat if XSLT has some primitive that would allow me to send the unmatched rest of the text node back to the main apply-template loop. However, that would simply be an optimization of tail recursion, and tail-recursion I can do today in XSLT (and Saxon does that at acceptable speed and reliability even for longer input text.)
I can do tail recursion by including the following near the end of the xsl:template body:
<xsl:apply-templates mode="..." select="saxon:node-set(substring(., match:end-offset($this-match,0)+1)"/>
as you can see, I extract the trailing substring that was not matched by the regex from the text-node, convert this trailing substring to an text-node again, and then I apply the same regex- enabled templates on that unparsed rest.
If all I had was some record-per-line flat data file, like a "csv" dump from your favorite database or spreadsheet app, this is all I would need. I could even eliminate tail recursion by using some input filter that splits a text node per line.
Many up-translation cases however are more complex and require some more sophisticated parser. One accepted model of a parser is that of a "Push-down Automaton", i.e., an automaton described by a state-transition diagram but with the ability to push part of he parse on the stack and continue parsing of substructures.
Another model of a parser is a top-down recursive descent parser, where you try matching your start symbol and in order to get that the parser will parse it's pieces, and the pieces of the pieces and so forth.
I guess am using some hybrid form between those two and I heavily rely on XSLT modes to describe the state of my parser at every given point in the execution. This works well for hierarchical data, such as can be found in traditional EDI encodings or some texts with itemized lists. There I have one mode per level in the hierarchy. For example, I may have a batch of reports where each report has a subject and a list of events and each events has a list of details. Then I have the modes: report-batch, report, and event (every non-terminal symbol becomes a mode.)
REPORT SUBJECT EVENT DETAIL DETAIL ... EVENT DETAIL DETAIL ... ... ...
Now, each template will use the apply-template trick on the unmatched text (the $rest) at least two times:
- when descending into the deep-structure of the text (I call that "parse-down")
- when doing the tail recursion as shown above (I call that "parse-along")
Every regex-based template will first do the parse-down on its own unmatched text, and then does the parse-along on the rest of the parse-down. In order to get to the rest we will have to catch the output of the parse-down in a variable as a result tree fragment (RTF) and then convert the RTF to a node-set, make a copy-of everything but the last text node and then do parse-along on that last text node.
<xsl:template mode="report-batch" match="text()[regex:match-by-name($report-header,.)]">
<xsl:variable name="this-text" select="saxon:node-set(match:group($this-match, 0))"/>
<xsl:variable name="parse-down-RTF"> <xsl:apply-template mode="report" select="saxon:node-set(substring(., match:end-offset($this-match,0)+1)"/> </xsl:variable> <xsl:variable name="parse-down" select="exsl:node-set(parse-down-RTF)"/>
<xsl:variable name="report-body" select="$parse-down/node()[not(position()=last())]/>
<!-- parse along, using same mode as ours --> <xsl:apply-templates mode="report-batch" select="$parse-down/text()[position()=last()]/>
I also define a default template for each mode that simply returns the text node:
<xsl:template mode="report-batch" match="text()"> <xsl:copy-of select="."/> </xsl:template>
This will always put the yet unparsed rest at the end of the current result tree fragment. When the down-parse terminates because no more matching patterns can be found at the beginning of the current text node, that remaining unparsed text node at the end is still used for the parse- along phase. If that fails it is handed up the parse-down stack and so forth.
It is O.K. for a parse template to return a text node, as long as there will be a final node at the end that can be passed into the parse-along chain.
Now, as you see, one can do quite a bit of text parsing using some of these patterns I described here. The nice thing about XSLT is that it is only natural to add some higer level abstractions to the XSLT language as extensions and then use an XSLT transform to generate excutable XSLT from that more abstract definition of the parser. I can go into more detail of this if you want, but my abstraction is at this point too prototypical and the real problems make it produce quite ugly XSLT code. So I want to talk about the problems first.
A typical free text parser often requires adding some semantic checking components in the rule that decide what symbol we currently have. In other words, the decision of whether a template matches is not just based on the mode and the regex pattern, but also on other circumstances. For example, in a semi-structured report, one may have to use the indention as an indication for whether we have to descend into the deep structure or if the next text stands for an element on the same or a higher level of the logical structure of the input text. For instance if we want to parse a list like this:
PROCARIONTS BACTERIAE EUCARIONTS PROTOZOA ALVEOLATES METAZOA ANNELIDA NEMATODA ARTHROPODA CHORDATA MAMMALIA FELIDES CANINES APES
we use the indention level to decide whether to parse down or to parse along or return to the next higher level. So, we would like a rule like this
<xsl:variable name="item-pattern" select="regex:new('item-pattern', "^\n?( *)([A-Z]*( [A-Z])*)(?=\n)"/>
<xsl:template mode="item" match="text()[regex:match-by-name('item-pattern',.)]">
<xsl:variable name="this-match" select="regex:get-match($item-pattern)"/> <xsl:variable name="this-indent" select="saxon:node-set(match:group($this-match, 1))"/> <xsl:variable name="item-name" select="saxon:node-set(match:group($this-match, 2))"/>
<!-- should process this item at the current level? --> <xsl:if test="string-length($this-indent) <=string-length($parent-indent)"> <!-- no, so try another template --> <xsl:next-match/> <!-- but there is no xsl:next-match instruction! --> </xsl:if>
<!-- parse down in the next lower level --> <xsl:variable name="parse-down-RTF"> <xsl:apply-template mode="item" select="saxon:node-set(substring(., match:end-offset($this-match,0)+1)"/> <xsl:with-param name="parent-indent" select="$this-indent"/> </xsl:apply-template> </xsl:variable> <xsl:variable name="parse-down" select="exsl:node-set(parse-down-RTF)"/>
<!-- generate result element --> <ITEM NAME="$item-name"> <xsl:copy-of select="$parse-down/node()[position()=1 or not(position()=last())]/> </ITEM>
<!-- parse along in the current level --> <xsl:apply-templates mode="item" select="$parse-down/text()[not(position()=1) and position()=last()]> <xsl:with-param name="parent-indent" select="$parent-indent"/> </xsl:apply-templates> </xsl:template>
The "xsl:next-match" element would have aborted the further processing of this template and try for another matching template (e.g. a catch-all that returns nothing.) This would cause the parent's parse-down apply-template statement to return, and the parent would finish the current item and then parse along. If that would fail the control would return to the next higher parent, etc.
The problem is while an element "xsl:next-match" is proposed for XSLT 2.0 as part of the requirements document, it doesn't currently exists and is not defined in the XSLT 2.0 working draft. So, this method does not work.
As a work-around, we can try moving as many of the checks into the template match pattern. This, however, becomes very messy (as in this case we would have to combine all the definitions of the $this-match and the $indent into a complicated nested call of extension funcitons.) While this could be done, especially with an XSLT-based compiler compiler, it cannot possibly work because XSLT does not allow any XSLT variables in xsl:template/@match patterns, which we would need to compare to the passed parameter $parent-indent. This is the critical point where the whole approach fails with current XSLT. We will see a feature like <xsl:next-match/> in the near future on XSLT 2.0 or 2.1 working draft and draft implementations (particularly in Saxon) will probably come up shortly. However, until then, we will have to change our strategy just to make it work now.
Instead of using XSLT templates per each regex pattern, I consolidate all regex-based templates into one template per mode that starts as follows:
so, we work on any text node. The next step is to consolidate all parameters that would be passed between the templates into one place. (This can be a hairy issue because we need to avoid name clashes.) The rest of the template is divided in two phases, (1) the regex matching and template selection phase and (2) an xsl:choose construct where each of the consolidated template bodies is one alternative. For example, we would rewrite the above list/item example as follows:
<xsl:template mode="item" match="text()"> <xsl:param name="parent-indent" select="-1"/>
<xsl:variable name="eligible-branches-RTF"> <xsl:if test="regex:match-by-name('item-pattern',.)"> <xsl:variable name="this-match" select="regex:get-match($item-pattern)"/> <xsl:variable name="this-indent" select="saxon:node-set(match:group($this-match, 1))"/>
<!-- check the conditions and, if positive, output an XML element of the name of this branch --> <xsl:if test="string-length($this-indent) >string-length($parent-indent)"> <item/> </xsl:if> </xsl:if> <!-- include other alternative templates in this mode --> </xsl:variable> <xsl:variable name="eligible-branches" select="exsl:node-set($eligible-branches-RTF)"/>
<xsl:choose> <xsl:when test="$eligible-branches/item"> <!-- the following is what was the template body only without the parameter declarations -->
<xsl:variable name="this-match" select="regex:get-match($item-pattern)"/> <xsl:variable name="this-indent" select="saxon:node-set(match:group($this-match, 1))"/> <xsl:variable name="item-name" select="saxon:node-set(match:group($this-match, 2))"/>
<!-- parse down in the next lower level --> <xsl:variable name="parse-down-RTF"> <xsl:apply-template mode="item" select="saxon:node-set(substring(., match:end-offset($this-match,0)+1)"/> <xsl:with-param name="parent-indent" select="$this-indent"/> </xsl:apply-template> </xsl:variable> <xsl:variable name="parse-down" select="exsl:node-set(parse-down-RTF)"/>
<!-- generate result element --> <ITEM NAME="$item-name"> <xsl:copy-of select="$parse-down/node()[position()=1 or not(position()=last())]/> </ITEM>
<!-- parse along in the current level --> <xsl:apply-templates mode="item" select="$parse-down/text()[not(position()=1) and position()=last()]> <xsl:with-param name="parent-indent" select="$parent-indent"/> </xsl:apply-templates> </xsl:when> <xsl:otherwise> <!-- this is the default returning the unparsed text --> <xsl:copy-of select="."/> </xsl:otherwise> </xsl:choose> </xsl:template>
As a parser is more complex the xsl:choose can become quite an unwieldy statement and putting all parameter declaration on top of the one template is certainly not optimal. However, this construct works. We now see a good use case for an XSLT parser pre-processor that generates this executable XSLT code from a more abstract and user-firendly description of the grammar.
The distinction between the eligibility-testing and then the big xsl:choose statement can be used as an advantage because it allows the parser to check if there are multiple eligible branches. This can be used to issue a warning to the user who might then do something about conflict resolution.
Conflict resolution in regular expressions turns out to be an important problem that perhaps can best be solved by the user with the feedback from the parser. Ideally a parser transform would check all regex pattern analytically finding out if there are strings that can be matched by more than one pattern. And if so, could find out which of the two is more specific and automatically assign higher priority to the more specific pattern. Solving this question analytically, however, is a hard problem in computer science, and hence we will have to fall back to heuristics or ask the user for help.
One possible heuristics is to assign priority to the more complex regex patterns, i.e., the pattern that has a longer string representation. Another heuristsics is to do some parsing of the pattern and compare the size of the generated finite directed graph (FDG). In addition we would want to assert priority to the rules with more additional semantic constraints (such as the once checking for the indent level.)
Whatever the source of the priority assignments is, it will] ultimately reflect in the sequence of the xsl:when clauses in the big xsl:choose statement.
On the surface, the new XPatch regex support would obsolete the ORO-Matcher and my regex wrapper object. However, the two functions that my wrapper served were:
- keep a regex with an internal state (caching the last match) to avoid frequent re-matching of the same text or pieces of it
- allow these regex objects to appear in xsl:template/@match patterns
Particularly if you add the new xsl:analyze-string form into the mix, the need for these kinds of things may be entirely gone.
But, I keep coming back to the analogy of xsl:template matching to regex pattern matching. Having the matching rules handled by real XSLT templates with regex in the @match pattern is quite intuitive and much more generally useful than the simple tokenization that happens in the analyze-string form. The analyze-string form can only test a single regex, but in text parsing you need to try many patterns against the current head of the unparsed text.
So, even though developed for XSLT v1.0 the approach to freetext up-translating described here is still valid. Particularly the new xsl:analyze-string instruction does not obsolte my approach; because it can only find a sequence of tokens of one regex. One could potentially determine an absolute order among all the possible regexes and then nest those in the xsl:non-matching-substring instruction. However, this gets very messy and it may not be easily possible to clearly bring all the rules into a total order when the number of matching rules grows. Templates with priorities. are a much more clear and flexible alternative.
At this point, we have to deal with a unwieldy XSLT code either way, however. With only two slight changes to XSLT, one could solve this problem nicely in XSLT with nice code. The most important of these changes is to allow failing a template and tell the processor to try another matching template (a la xsl:next-match proposal.)
regards -Gunther
-- Gunther Schadow, M.D., Ph.D. gschadow@xxxxxxxxxxxxxxx Medical Information Scientist Regenstrief Institute for Health Care Adjunct Assistant Professor Indiana University School of Medicine tel:1(317)630-7960 http://aurora.regenstrief.org
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] Fwd: white space in MSIE, Dion Houston | Thread | RE: [xsl] Regex-Enabled XSLT is Pos, Michael Kay |
[xsl] Fwd: white space in MSIE, Wendell Piez | Date | [xsl] Java exception handling in XS, Gunther Schadow |
Month |