Re: [xsl] Question on duplicate node elimination

Subject: Re: [xsl] Question on duplicate node elimination
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Mon, 23 Aug 2010 10:57:47 +0100
Looking at your code, I can see why you are stuck. But I'm afraid my answer is that to get to your desired destination "I wouldn't start from here". The assumption that you can parse the XPath expression from left to right and evaluate it as you parse it is just fundamentally flawed, and your problem with duplicate elimination is just the first point where you have discovered this.

That leaves the question, how would I advise tackling this problem? Well, I'd start by allocating rather more time for it than I have available. Using the interpreter design pattern the normal thing would be to start by parsing to generate an abstract syntax tree (Jeni's code seems to do much of this for you - she's a she, by the way). Then conventionally you write an interpreter that walks the nodes of this tree treating each node as an operator to be evaluated by evaluating its sub-expressions and returning a result. The problem here is that you if you want to write this in XSLT 1.0, XSLT 1.0 templates don't give you the opportunity to return node-sets (even with the exslt:node-set() function, you can only return newly constructed nodes, not nodes from the input document). To get round that problem, you can probably use the FXSL approach: pass the template a parameter representing a function to be applied to the selected nodes. But it's going to be tough, and I don't have the head for it this morning.

The other - considerably easier - approach would be to write the interpreter in Javascript.

Michael Kay
Saxonica

On 23/08/2010 09:23, Hermann Stamm-Wilbrandt wrote:
What makes you think it would be difficult?
It looks at least not obvious to me and I thought I better ask here
before reinventing the wheel.

Of course, a processor needs some way to decide whether two nodes are
identical/distinct. Given such a mechanism, it's not difficult to come
up with algorithms that eliminate duplicate nodes.
The question is how this mechanism might look like in XPath 1.0 plus
exslt:node-set function, see below.

In practice, when XPath 1.0 is used as part of XSLT 1.0, the XPath
requirement to eliminate duplicates can always be combined with the XSLT
requirement to deliver the node-set sorted in document order.
OK, in thread [1] it was asked how to dynamically evaluate XPath
expressions.
Dimitre's posting [2] of a nametest only single node result XSLT 1.0
solution and the enhancement of that to handle result sets in [3] made me
start my own XPath evaluator 10 days ago [4]. While it is complete on the
Location step side it has near zero support for predicates and cannot
handle duplicate node elimination (last sample).

In posting [5] I learned that XSLTForm project has its own built in (not
complete) XPath evaluator. Editing file xpath.xml allows to test some
XPath expressions and it turned out that this XPath evaluator is not able
to do duplicate node elimination as well.

Searching Google for "XPath Parser" did not help. But searching the list
archives of xsl-list I found the very interesting posting [6] of Jeni
Tenison on his XPath parser xpath.xsl generating a parse tree and
xpath-doc.xsl generating an explanation of a XPath statement.

His xpath.xsl seems to be pretty complete (as an indicator the call graph
for the named templates [7], generated by stylesheet [10] with dot/graphwiz
[9] seems to match the spec quite good).

So my idea was to just start with xpath-doc.xsl and instead of generating
an explanation make it do the actual XPath evaluation instead.
And this is the point where duplicate node elimination is needed.

So the
natural way to eliminate duplicates is as part of the sorting process.
Which sorting process -- a global one or the one for the incremental steps
in working on the XPath parse tree?

ANY help is really appreciated!


Take this XPath as a more complex example (the duplicate nodes may exist on any level in the input ducument): //c/ancestor::*

Jeni's xpath-doc.xsl generates this explanation:
   the root node's descendant<span class="name">c</span>  elements's
ancestor
   elements

And his xpath.xsl generates this parse tree:
<path>
   <root/>
   <step axis="descendant-or-self">
     <node/>
   </step>
   <step axis="child">
     <element name="c" namespace="default"/>
   </step>
   <step axis="ancestor">
     <element namespace="any"/>
   </step>
</path>


Having a solution would enable the major browsers (with the technique of David Carlisle [8]) to dynamically evaluate Xpath expressions without having a dyn:evaluate() implementation.


[1]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node
[2]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node/3017752#3017752
[3]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node/3485079#3485079
[4] http://stamm-wilbrandt.de/en/xsl-list/xpath1/v0.3/
[5]
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/201008/msg00203.html
[6]
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200212/msg00315.html
[7] http://stamm-wilbrandt.de/en/xsl-list/xpath.dot.pdf
[8] http://dpcarlisle.blogspot.com/2007/05/exslt-node-set-function.html
[9] http://www.graphviz.org/
[10]
$ cat dotnc.xsl
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
<xsl:output method="text"/>

   <xsl:template match="/">
digraph G { size="7.5,10"; rankdir=TB; concentrate=true;

     <xsl:for-each select="//xsl:template[@name]">
       "<xsl:value-of select="@name"/>"[shape=box];
       <xsl:if test=".//xsl:apply-templates">
         "<xsl:value-of select="@name"/>" ->  "xsl:apply-templates";
       </xsl:if>
     </xsl:for-each>

     <xsl:for-each select="//xsl:call-template">
       "<xsl:value-of select="string(ancestor::xsl:template/@name)"/>"
       ->  "<xsl:value-of select="@name"/>";
     </xsl:for-each>
}
   </xsl:template>

</xsl:stylesheet>
$
$ xsltproc dotnc.xsl xpath.xsl>xpath.dot
$ dot -Tps xpath.dot>xpath.dot.ps
$ ps2pdf xpath.dot.ps
$


Mit besten Gruessen / Best wishes,


Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research&  Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



From:       Michael Kay<mike@xxxxxxxxxxxx>
To:         xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Date:       08/23/2010 12:36 AM
Subject:    Re: [xsl] Question on duplicate node elimination




But how could the algorithm step of "duplicate elimination" be done?
How can the duplicates be determined and removed, correctly?


What makes you think it would be difficult?

Of course, a processor needs some way to decide whether two nodes are
identical/distinct. Given such a mechanism, it's not difficult to come
up with algorithms that eliminate duplicate nodes.

In practice, when XPath 1.0 is used as part of XSLT 1.0, the XPath
requirement to eliminate duplicates can always be combined with the XSLT
requirement to deliver the node-set sorted in document order. So the
natural way to eliminate duplicates is as part of the sorting process.

For performance, the most important technique is static analysis to
identify those path expressions where the sort (and duplicate
elimination) are unnecessary. For example, this is the case for the
expression /a/b/c if it is evaluated either (a) using nested loops, or
(b) by scanning the entire source document looking for nodes that match
this pattern. For the expression //x//y, a sort is necessary if the
evaluation uses nested loops, but not if it uses a whole-document scan
and pattern matching. Remember that the evaluation techniques used
internally may be very different from the descriptions you find in
explanations of the semantics of the language.

The way you have phrased the question suggests that you might be
worrying about how exslt:node-set() affects the process. Simple answer -
it doesn't.

Michael Kay
Saxonica

Current Thread