Re: [xsl] [Summary] Three ways to express in XPath that there are no duplicates in a list of items

Subject: Re: [xsl] [Summary] Three ways to express in XPath that there are no duplicates in a list of items
From: Dimitre Novatchev <dnovatchev@xxxxxxxxx>
Date: Fri, 2 Nov 2012 06:01:06 -0700
> count(Websites/*) = count(distinct-values(Websites/*))

A more efficient version (for an XPath processor with weak optimizer)
of this is:

not(Websites/*[count(distinct-values(Websites/*))+1])

So, we only count up to:
     count(distinct-values(Websites/*))+1

and don't need to count all children of Websites and then compare them
to the count of distinct values.

This is the same as Dirichlet's principle (aka "pigeonhole principle"):

If n items are put into m pigeonholes with n > m, then at least one
pigeonhole must contain more than one item.


Cheers,
Dimitre

On Fri, Nov 2, 2012 at 2:33 AM, Liam R E Quin <liam@xxxxxx> wrote:
> On Fri, 2012-11-02 at 09:15 +0000, Costello, Roger L. wrote:
> [...]
>> Here's how to implement it in XPath 1.0 and in XPath 2.0.
>>
>> XPath 1.0:
>>
>>       not(Websites/*[. = preceding-sibling::*])
>>
>> XPath 2.0:
>>
>>       empty(Websites/*[index-of(../*,.)[2]])
>>       count(Websites/*) = count(distinct-values(Websites/*))
>>
>> The preferred XPath is the last one because it has the best
>> performance. The first two take on the order of n-squared time (where
>> n is the number of websites in the list) whereas the last XPath
>> expression takes on the order of n log n time.
>
> Some brief comments on this...
> (1) the XPath 1 solution also works in later versions of XPath;
> (2) this only works to test simple text content
> (3) an XPath 2 (or later) processor can rewrite the first or second
> expression if it likes (and is smart enough) into the third
> (4) the last expression can be implemented in linear time, not n log n,
> because it's not not necessary to sort values to see if they are
> distinct (e.g. the implementation can use a hash table), although this
> does use more memory. But for simple content the hash table approach
> doesn't even need to use more memory, can just point into the document
> tree.
>
> So statements about performance need to be with respect to a particular
> version of a specific product, in a specific environment.
>
> And,
> (5) I'm sure these aren't the only ways to see if all items are
> distinct :-)
>
> You can start getting pretty fancy,
> some $w in Websites/Website satisfies count(Websites/Website[. eq $w])
> gt 1,
> or flwor expressions in XQuery,
> or use a key or (XPath in XSLT 3) a map, or.... :)
>
> Liam
>
>
> --
> Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
> Pictures from old books: http://fromoldbooks.org/
> Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
>



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
Never fight an inanimate object
-------------------------------------
To avoid situations in which you might make mistakes may be the
biggest mistake of all
------------------------------------
Quality means doing it right when no one is looking.
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play
-------------------------------------
Facts do not cease to exist because they are ignored.
-------------------------------------
Typing monkeys will write all Shakespeare's works in 200yrs.Will they
write all patents, too? :)
-------------------------------------
I finally figured out the only reason to be alive is to enjoy it.

Current Thread