Re: [xsl] is there a way to hash an element?

Subject: Re: [xsl] is there a way to hash an element?
From: "Dimitre Novatchev dnovatchev@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 13 Jun 2016 06:54:22 -0000
 Hi Graydon,


> Going through and decorating every element with its hash value
> (@hash="something") and then using for-each-group on the lot on the
> basis of the hash gives me 2n.  Even if it's a very naive hash
> implementation, I'd expect 2n to beat n^2 performance.

Maybe I don't understand something here, but I thought you need to
hash every sub-tree -- not just every element ?

Maybe we need to take into account here the average number
(fortunately not order) of attributes per element, too?

Even if the accumulated hash from the ancestors is used, one still
need to hash all attributes of the element. And the values of
text-node children. And namespaces (namespace nodes) too, need to be
reflected.

Also, combining the accumulated hash with the local one needs a more
complicated operation than just addition (which can be ambiguous), but
probably shifting (multiplication) and addition.


Anyway, deep-equal() only returns a Boolean, so it would be difficult
to use it for grouping.

I still think that having the whole document serialized and annotating
every element with its start in the serialized string may be efficient
and convenient.


Cheers,
Dimitre


On Sun, Jun 12, 2016 at 6:17 PM, Graydon graydon@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> On Sat, Jun 11, 2016 at 05:21:09PM -0000, Dimitre Novatchev dnovatchev@xxxxxxxxx scripsit:
> Hi Dimitre --
>> Actually, I believe that calling deep-equal() can be more efficient
>> than comparing hashes.
>>
>> The reason is simple: deep-equal() most probably returns false at the
>> first possible moment -- for example, noticing that an element has
>> different attributes than its counterpart.
>>
>> On the other side, with hashing,  the hashes for the two whole
>> subtrees have to be calculated and only after that they can be
>> compared.
>>
>> To summarize, with the exception of the case when the two subtrees are
>> equal, deep-equal may perform faster than generating and comparing
>> hashes on the subtrees.
>
> I've got one input document with ~5000 trees that are mappable to XSD
> schema definitions; about half are complexTypes.  Many are structurally
> the same but have different names. (All ~5000 have unique names.)
>
> The idea is to group them by structural sameness; deep-equal, even very
> efficiently implemented deep-equal, gives me n^2 as I have to go through
> the whole tree for each element and ask "are you like me?" pairwise.
> Some of the equivalent structures will have a lot of matches -- hundreds
> -- where I can't expect deep-equal to fail quickly and thus efficiently.
>
> Going through and decorating every element with its hash value
> (@hash="something") and then using for-each-group on the lot on the
> basis of the hash gives me 2n.  Even if it's a very naive hash
> implementation, I'd expect 2n to beat n^2 performance.
>
> Am I missing something?
>
> (I'll certainly keep deep-equal in mind if the hash approach has
> unacceptable performance.)
>
> -- Graydon
> 



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
To achieve the impossible dream, try going to sleep.
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
Sanity is madness put to good use.
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.

Current Thread