Canonical XML in Databases (Re: [xsl] sort problem)

Subject: Canonical XML in Databases (Re: [xsl] sort problem)
From: Peter Davis <pdavis152@xxxxxxxxx>
Date: Thu, 5 Sep 2002 00:16:43 -0700
On Thursday 05 September 2002 00:07, you wrote:
> [the output XML] will be treated as text at a later point. We need to match 
> xml
> from a database of some 100 xmls & find a match for the same. The problem
> is that to match 2 xmls, we will be using text comparison as
> 1. the database mite not support xml parsers
> 2. DOM matching wud be very time consuming...
> so the requirement is to store all files as text sorted as they will be
> treated as text only files by the database...
> in case there is a better way to find xml matches, do help me out on the
> same too...

(Hope you don't mind if I post this to the list -- I think it is an 
interesting question.)

Hmm, I have to say that isn't a very robust way to go about it.  There are 
several assumptions you have to make that can be broken by any piece of your 
system.

A lot of thought has been put into this problem, and the answer is even more 
complicated than just comparing two DOMs.  See this W3C recommendation for 
canonicalizing ("c14n") XML documents:

http://www.w3.org/TR/xml-c14n


The assumtions you have to make when you compare XML documents as text are 
(but aren't limited to):

* Attribute order: even if you use <xsl:sort> when outputting attributes, 
there is no guarantee that your XSLT processor will honor that order.

* Character sets: two documents can be written in different character sets and 
have different byte representations (your database might compare the text as 
a string of bytes, rather than a string of Unicode characters), but yet have 
the same meaning.

* Character escaping / CDATA sections: exactly which characters are escaped by 
your processor is not guaranteed.  For example, '>' and '&gt;' have the same 
meaning, but obviously different text values.

I'm sure there are many other considerations, which should all be addressed by 
the xml-c14n spec.


I'm not saying what you are trying to do won't work.  As long as you always 
use the same XML processor, stylesheet, character set (UTF-8?), and you don't 
add comments, CDATA sections, whitespace, etc., this will work into the 
future.  What you should consider is what happens when you try to use a newer 
version of your processor that changes its rules (but still outputs 
equivilant XML, just not equivilant text), or if some new person comes to 
work for you who doesn't follow your rules.  Maintainability is always an 
issue when you are designing software systems.

So, just keep the issues in mind when you do this.  It will work, but if it 
stops working one day after you try to upgrade your system, you will know 
where to look.

-- 
Peter Davis

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread