Re: [xsl] is there a way to hash an element?

Subject: Re: [xsl] is there a way to hash an element?
From: "Graydon graydon@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 10 Jun 2016 09:04:37 -0000
That's an excellent point; thank you!

Because, yes, the serializations do end up being very long, which is why
I hadn't thought of going with serialize(); not being able to tell by
inspection which ones matched wasn't going to improve matters a whole
lot on the certainty front.

On Fri, Jun 10, 2016 at 07:59:46AM -0000, David Rudel fwqhgads@xxxxxxxxx scripsit:
> Note that if these serializations end up being very long and you want
> to reduce to a small signature (to match a typical hash), you can use
> string-to-codepoints() function to generate a set of integers from any
> string that can be used to roll-you-own hashing function. Since you
> are just interested in checking that two descendant subtrees are
> identical---and are not concerned with security---a very simple
> compaction function would work fine. For example, you could create a
> user-defined function that takes any sequence of integers and returns
> the string X---Y, where X = length of sequence and Y is the remainder
> of $seq!(position() * .) upon division by a suitably large number (an
> extension of the typical UPC checksum algorithm).
> 
> On Fri, Jun 10, 2016 at 12:51 AM, Dimitre Novatchev
> dnovatchev@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> > You may even not need a hash function.
> >
> > Just use the standard XPath 3.0 function:
> >
> >   serialize()
> >
> >
> > http://www.w3.org/TR/xpath-functions-30/#func-serialize
> >
> >
> > Cheers,
> > Dimitre
> >
> > On Thu, Jun 9, 2016 at 3:08 PM, Graydon graydon@xxxxxxxxx
> > <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> >> Hello all --
> >>
> >> So I've got about half a gibabyte of XML messages describing various
> >> health care actions.  Many of these are structural duplicates of each
> >> other; the top elements differ by their attribute values, but the
> >> structure and values of the descendant elements is the same.  The amount
> >> of duplication varies from none to thousands.
> >>
> >> I've got an apparently useful heuristic based on descendant attribute
> >> values, but would -- it is health care data -- really like to have a
> >> more robust way to group the elements into set of equivalent top-level
> >> names by their structural sameness.  (I can't hand-check the whole data
> >> set.)
> >>
> >> So I find myself wanting an equivalent of sha256sum for elements so I
> >> could generate a grouping key from the descendant elements and their
> >> associated attributes as a unit.
> >>
> >> Is there such a thing?  Equivalent approaches?
> >>
> >> Thanks!
> >> Graydon
> >>
> >
> >
> >
> > --
> > Cheers,
> > Dimitre Novatchev
> > ---------------------------------------
> > Truly great madness cannot be achieved without significant intelligence.
> > ---------------------------------------
> > To invent, you need a good imagination and a pile of junk
> > -------------------------------------
> > Never fight an inanimate object
> > -------------------------------------
> > To avoid situations in which you might make mistakes may be the
> > biggest mistake of all
> > ------------------------------------
> > Quality means doing it right when no one is looking.
> > -------------------------------------
> > You've achieved success in your field when you don't know whether what
> > you're doing is work or play
> > -------------------------------------
> > To achieve the impossible dream, try going to sleep.
> > -------------------------------------
> > Facts do not cease to exist because they are ignored.
> > -------------------------------------
> > Typing monkeys will write all Shakespeare's works in 200yrs.Will they
> > write all patents, too? :)
> > -------------------------------------
> > Sanity is madness put to good use.
> > -------------------------------------
> > I finally figured out the only reason to be alive is to enjoy it.
> > 
> 
> 
> 
> -- 
> 
> "A false conclusion, once arrived at and widely accepted is not
> dislodged easily, and the less it is understood, the more tenaciously
> it is held." - Cantor's Law of Preservation of Ignorance.

Current Thread