Subject: Re: [xsl] is there a way to hash an element? From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Fri, 10 Jun 2016 09:04:37 -0000 |
That's an excellent point; thank you! Because, yes, the serializations do end up being very long, which is why I hadn't thought of going with serialize(); not being able to tell by inspection which ones matched wasn't going to improve matters a whole lot on the certainty front. On Fri, Jun 10, 2016 at 07:59:46AM -0000, David Rudel fwqhgads@xxxxxxxxx scripsit: > Note that if these serializations end up being very long and you want > to reduce to a small signature (to match a typical hash), you can use > string-to-codepoints() function to generate a set of integers from any > string that can be used to roll-you-own hashing function. Since you > are just interested in checking that two descendant subtrees are > identical---and are not concerned with security---a very simple > compaction function would work fine. For example, you could create a > user-defined function that takes any sequence of integers and returns > the string X---Y, where X = length of sequence and Y is the remainder > of $seq!(position() * .) upon division by a suitably large number (an > extension of the typical UPC checksum algorithm). > > On Fri, Jun 10, 2016 at 12:51 AM, Dimitre Novatchev > dnovatchev@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > > You may even not need a hash function. > > > > Just use the standard XPath 3.0 function: > > > > serialize() > > > > > > http://www.w3.org/TR/xpath-functions-30/#func-serialize > > > > > > Cheers, > > Dimitre > > > > On Thu, Jun 9, 2016 at 3:08 PM, Graydon graydon@xxxxxxxxx > > <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > >> Hello all -- > >> > >> So I've got about half a gibabyte of XML messages describing various > >> health care actions. Many of these are structural duplicates of each > >> other; the top elements differ by their attribute values, but the > >> structure and values of the descendant elements is the same. The amount > >> of duplication varies from none to thousands. > >> > >> I've got an apparently useful heuristic based on descendant attribute > >> values, but would -- it is health care data -- really like to have a > >> more robust way to group the elements into set of equivalent top-level > >> names by their structural sameness. (I can't hand-check the whole data > >> set.) > >> > >> So I find myself wanting an equivalent of sha256sum for elements so I > >> could generate a grouping key from the descendant elements and their > >> associated attributes as a unit. > >> > >> Is there such a thing? Equivalent approaches? > >> > >> Thanks! > >> Graydon > >> > > > > > > > > -- > > Cheers, > > Dimitre Novatchev > > --------------------------------------- > > Truly great madness cannot be achieved without significant intelligence. > > --------------------------------------- > > To invent, you need a good imagination and a pile of junk > > ------------------------------------- > > Never fight an inanimate object > > ------------------------------------- > > To avoid situations in which you might make mistakes may be the > > biggest mistake of all > > ------------------------------------ > > Quality means doing it right when no one is looking. > > ------------------------------------- > > You've achieved success in your field when you don't know whether what > > you're doing is work or play > > ------------------------------------- > > To achieve the impossible dream, try going to sleep. > > ------------------------------------- > > Facts do not cease to exist because they are ignored. > > ------------------------------------- > > Typing monkeys will write all Shakespeare's works in 200yrs.Will they > > write all patents, too? :) > > ------------------------------------- > > Sanity is madness put to good use. > > ------------------------------------- > > I finally figured out the only reason to be alive is to enjoy it. > > > > > > -- > > "A false conclusion, once arrived at and widely accepted is not > dislodged easily, and the less it is understood, the more tenaciously > it is held." - Cantor's Law of Preservation of Ignorance.
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] is there a way to hash an, David Rudel fwqhgads | Thread | Re: [xsl] is there a way to hash an, Dimitre Novatchev dn |
Re: [xsl] is there a way to hash an, David Rudel fwqhgads | Date | Re: [xsl] is there a way to hash an, Michael Kay mike@xxx |
Month |