Re: [xsl] Testing 2 XML documents for equality - a solution

Subject: Re: [xsl] Testing 2 XML documents for equality - a solution
From: David Carlisle <davidc@xxxxxxxxx>
Date: Wed, 30 Mar 2005 18:10:44 +0100
  (i.e. introducing an extra character between attribute
  name and value, which is unlikely to occur in the
  attribute value; for e.g. a newline character) 

how do you define unlikely? I can easily provide a counter example.
(although actually adding such a separator works even if
the separator is in the attribute value, as it uniquely terminates teh
name in the string, you only need to use a character that is not a name

I mentioned attributes but you do the same for elements so you need the
same fix there (with a different character) as you otherwise don't
distringuish element nodes from attribute nodes of the same name.

I also notice that you don't record which element an attribute is on, so
looking at your proposed fix

<xsl:for-each select="$doc1//@*">
  <xsl:value-of select="name()"
/><xsl:text>&#xa;</xsl:text><xsl:value-of select="."

<x a="2">


 <b a="2"/>

would both generate the same attribute test string of 
so would compare equal.

  These documents are reported not equal!

are you sure?

  I think here I am right!


   For this example, the $doc1//node() path
   expression returns 4 nodes (2 element nodes and 2
   "white space text nodes")


  The "white space text
  nodes" will be filtered by the predicate
  [not(normalize-space(self::text()) = '')] 

yes but also any element node will be filtered as self::text() on an
element node will return an empty node set (as it isn't a text node)
and normalize-space() on that will return ''

so the whole select expression on the for-each returns an empty node

  I agree that the XML parser is not expected to report
  attribute nodes in same order. But I guess we can
  reasonably assume that a "specific XML parser" would
  report attributes in same order.

more guesses.

  I have tested the same example with a single product
  multiple times, and always I am getting same result..

probably true, but you never really know. attributes are often put into
some kind of hashed data structure so the order they come out can depend
on all sorts of strange factors.

These things can be fixed by (eg) sorting attribute nodes to be
alphabetical) but as Michael just indicated the process is always likely
to be very inefficient. You _always_ generate a really huge string for
each document even if the top level nodes are
<foo version="1"> and <foo version="2">
you'd really like to stop there and not generate a text string of the
100001 child nodes below foo.

Given that you are walking over the trees anyway to generate the
strings, you should be able to walk over th etwo trees in parallel and
stop whenever you find a difference.


See what saxon says:

$ saxon eq.xsl eq.xsl  iws=y


$ cat file1.xml

$ cat file2.xml

so when ignoring white space text nodes the stylesheet reports 
as equal to

This e-mail has been scanned for all viruses by Star. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:

Current Thread