Re: [xsl] Question on duplicate node elimination

Subject: Re: [xsl] Question on duplicate node elimination
From: Hermann Stamm-Wilbrandt <STAMMW@xxxxxxxxxx>
Date: Mon, 23 Aug 2010 10:23:44 +0200
> What makes you think it would be difficult?
It looks at least not obvious to me and I thought I better ask here
before reinventing the wheel.

> Of course, a processor needs some way to decide whether two nodes are
> identical/distinct. Given such a mechanism, it's not difficult to come
> up with algorithms that eliminate duplicate nodes.
The question is how this mechanism might look like in XPath 1.0 plus
exslt:node-set function, see below.

> In practice, when XPath 1.0 is used as part of XSLT 1.0, the XPath
> requirement to eliminate duplicates can always be combined with the XSLT
> requirement to deliver the node-set sorted in document order.

OK, in thread [1] it was asked how to dynamically evaluate XPath
expressions.
Dimitre's posting [2] of a nametest only single node result XSLT 1.0
solution and the enhancement of that to handle result sets in [3] made me
start my own XPath evaluator 10 days ago [4]. While it is complete on the
Location step side it has near zero support for predicates and cannot
handle duplicate node elimination (last sample).

In posting [5] I learned that XSLTForm project has its own built in (not
complete) XPath evaluator. Editing file xpath.xml allows to test some
XPath expressions and it turned out that this XPath evaluator is not able
to do duplicate node elimination as well.

Searching Google for "XPath Parser" did not help. But searching the list
archives of xsl-list I found the very interesting posting [6] of Jeni
Tenison on his XPath parser xpath.xsl generating a parse tree and
xpath-doc.xsl generating an explanation of a XPath statement.

His xpath.xsl seems to be pretty complete (as an indicator the call graph
for the named templates [7], generated by stylesheet [10] with dot/graphwiz
[9] seems to match the spec quite good).

So my idea was to just start with xpath-doc.xsl and instead of generating
an explanation make it do the actual XPath evaluation instead.
And this is the point where duplicate node elimination is needed.

> So the
> natural way to eliminate duplicates is as part of the sorting process.
Which sorting process -- a global one or the one for the incremental steps
in working on the XPath parse tree?

ANY help is really appreciated!


Take this XPath as a more complex example (the duplicate nodes may exist
on any level in the input ducument):
//c/ancestor::*

Jeni's xpath-doc.xsl generates this explanation:
  the root node's descendant <span class="name">c</span> elements's
ancestor
  elements

And his xpath.xsl generates this parse tree:
<path>
  <root/>
  <step axis="descendant-or-self">
    <node/>
  </step>
  <step axis="child">
    <element name="c" namespace="default"/>
  </step>
  <step axis="ancestor">
    <element namespace="any"/>
  </step>
</path>


Having a solution would enable the major browsers (with the technique of
David Carlisle [8]) to dynamically evaluate Xpath expressions without
having a dyn:evaluate() implementation.


[1]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node
[2]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node/3017752#3017752
[3]
http://stackoverflow.com/questions/3015942/retrieving-xml-node-from-a-path-specified-in-an-attribute-value-of-another-node/3485079#3485079
[4] http://stamm-wilbrandt.de/en/xsl-list/xpath1/v0.3/
[5]
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/201008/msg00203.html
[6]
http://www.biglist.com/lists/lists.mulberrytech.com/xsl-list/archives/200212/msg00315.html
[7] http://stamm-wilbrandt.de/en/xsl-list/xpath.dot.pdf
[8] http://dpcarlisle.blogspot.com/2007/05/exslt-node-set-function.html
[9] http://www.graphviz.org/
[10]
$ cat dotnc.xsl
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>
  <xsl:output method="text"/>

  <xsl:template match="/">
digraph G { size="7.5,10"; rankdir=TB; concentrate=true;

    <xsl:for-each select="//xsl:template[@name]">
      "<xsl:value-of select="@name"/>"[shape=box];
      <xsl:if test=".//xsl:apply-templates">
        "<xsl:value-of select="@name"/>" -> "xsl:apply-templates";
      </xsl:if>
    </xsl:for-each>

    <xsl:for-each select="//xsl:call-template">
      "<xsl:value-of select="string(ancestor::xsl:template/@name)"/>"
      -> "<xsl:value-of select="@name"/>";
    </xsl:for-each>
}
  </xsl:template>

</xsl:stylesheet>
$
$ xsltproc dotnc.xsl xpath.xsl >xpath.dot
$ dot -Tps xpath.dot >xpath.dot.ps
$ ps2pdf xpath.dot.ps
$


Mit besten Gruessen / Best wishes,

Hermann Stamm-Wilbrandt
Developer, XML Compiler, L3
WebSphere DataPower SOA Appliances
----------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Martin Jetter
Geschaeftsfuehrung: Dirk Wittkopp
Sitz der Gesellschaft: Boeblingen
Registergericht: Amtsgericht Stuttgart, HRB 243294



From:       Michael Kay <mike@xxxxxxxxxxxx>
To:         xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Date:       08/23/2010 12:36 AM
Subject:    Re: [xsl] Question on duplicate node elimination




> But how could the algorithm step of "duplicate elimination" be done?
> How can the duplicates be determined and removed, correctly?
>
>

What makes you think it would be difficult?

Of course, a processor needs some way to decide whether two nodes are
identical/distinct. Given such a mechanism, it's not difficult to come
up with algorithms that eliminate duplicate nodes.

In practice, when XPath 1.0 is used as part of XSLT 1.0, the XPath
requirement to eliminate duplicates can always be combined with the XSLT
requirement to deliver the node-set sorted in document order. So the
natural way to eliminate duplicates is as part of the sorting process.

For performance, the most important technique is static analysis to
identify those path expressions where the sort (and duplicate
elimination) are unnecessary. For example, this is the case for the
expression /a/b/c if it is evaluated either (a) using nested loops, or
(b) by scanning the entire source document looking for nodes that match
this pattern. For the expression //x//y, a sort is necessary if the
evaluation uses nested loops, but not if it uses a whole-document scan
and pattern matching. Remember that the evaluation techniques used
internally may be very different from the descriptions you find in
explanations of the semantics of the language.

The way you have phrased the question suggests that you might be
worrying about how exslt:node-set() affects the process. Simple answer -
it doesn't.

Michael Kay
Saxonica

Current Thread