Subject: Canonical XML in Databases (Re: [xsl] sort problem) From: Peter Davis <pdavis152@xxxxxxxxx> Date: Thu, 5 Sep 2002 00:16:43 -0700 |
On Thursday 05 September 2002 00:07, you wrote: > [the output XML] will be treated as text at a later point. We need to match > xml > from a database of some 100 xmls & find a match for the same. The problem > is that to match 2 xmls, we will be using text comparison as > 1. the database mite not support xml parsers > 2. DOM matching wud be very time consuming... > so the requirement is to store all files as text sorted as they will be > treated as text only files by the database... > in case there is a better way to find xml matches, do help me out on the > same too... (Hope you don't mind if I post this to the list -- I think it is an interesting question.) Hmm, I have to say that isn't a very robust way to go about it. There are several assumptions you have to make that can be broken by any piece of your system. A lot of thought has been put into this problem, and the answer is even more complicated than just comparing two DOMs. See this W3C recommendation for canonicalizing ("c14n") XML documents: http://www.w3.org/TR/xml-c14n The assumtions you have to make when you compare XML documents as text are (but aren't limited to): * Attribute order: even if you use <xsl:sort> when outputting attributes, there is no guarantee that your XSLT processor will honor that order. * Character sets: two documents can be written in different character sets and have different byte representations (your database might compare the text as a string of bytes, rather than a string of Unicode characters), but yet have the same meaning. * Character escaping / CDATA sections: exactly which characters are escaped by your processor is not guaranteed. For example, '>' and '>' have the same meaning, but obviously different text values. I'm sure there are many other considerations, which should all be addressed by the xml-c14n spec. I'm not saying what you are trying to do won't work. As long as you always use the same XML processor, stylesheet, character set (UTF-8?), and you don't add comments, CDATA sections, whitespace, etc., this will work into the future. What you should consider is what happens when you try to use a newer version of your processor that changes its rules (but still outputs equivilant XML, just not equivilant text), or if some new person comes to work for you who doesn't follow your rules. Maintainability is always an issue when you are designing software systems. So, just keep the issues in mind when you do this. It will work, but if it stops working one day after you try to upgrade your system, you will know where to look. -- Peter Davis XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] A few questions to the ex, "Braumüller, Hans" | Thread | [xsl] Re: A few questions to the ex, Dimitre Novatchev |
Re: [xsl] deep "copy-of" a source f, Peter Davis | Date | [xsl] Re: A few questions to the ex, Dimitre Novatchev |
Month |