RE: Re: [xsl] What is a better word for "de-duplication"?

Subject: RE: Re: [xsl] What is a better word for "de-duplication"?
From: cknell@xxxxxxxxxx
Date: Mon, 28 Aug 2006 19:26:02 -0400
All sorts of terms with ambiguous or impenetrable meanings don't help. They muddy the water. A tool need not be pretty to be useful. Is there any doubt about the meaning of "de-duplication"? Not from where I sit.
-- 
Charles Knell
cknell@xxxxxxxxxx - email



-----Original Message-----
From:     Andrew Franz <afranz0@xxxxxxxxxxxxxxxx>
Sent:     Tue, 29 Aug 2006 08:12:40 +1000
To:       xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject:  Re: [xsl] What is a better word for "de-duplication"?

Wendell Piez wrote:

> At 03:33 PM 8/28/2006, Andrew wrote:
>
>> Wendell Piez wrote:
>>
>>> Dear Dimitre,
>>>
>>> At 08:41 PM 8/27/2006, you wrote:
>>>
>>>> I want to use a single, short word to express the act of removing
>>>> duplicates from a node-set. I remember seing the word "de-duplication"
>>>> used, however it sounds ugly.
>>>
>>>
>> Normalisation
>
>
> Normalization (or 'normalisation' for those who prefer British 
> orthography) would rather be the general process of transforming a set 
> of values into their normalized forms. So,
>
> <date value="2006">May Day 2006</date>
> <date value="2006-05-01"/>
> <date value="5-1-2006">May 1 2006</date>
>
> might be normalized as
>
> <date value="2006-05-01">May 1 2006</date>
> <date value="2006-05-01">May 1 2006</date>
> <date value="2006-05-01">May 1 2006</date>
>
> but this would not deduplicate them.
>
> These are very different problems, especially for XSLT. Generally 
> speaking, deduplicating requires normalization first since 
> deduplication works only over canonical forms (or comparing them to 
> see which are duplicates becomes very difficult).
>
> Cheers,
> Wendell

Yes, this is one meaning of 'normalisation'. But 'normalisation' is 
richer and deeper than that. Think about relational database theory.

//2NF = / A relation is in 2NF if it is in 1NF and every non-key 
attribute is fully dependent on each candidate key of the relation
In the above example:
/    <date value="2006">May Day 2006</date>
    <date value="2006-05-01"/>
    <date value="5-1-2006">May 1 2006</date>
becomes:
    <standardDate id="x" year="2006" month="5" day="1" />
    plus:
    <date id="x" format="t yyyy">May Day</date>
    <date id="x" format="yyyy-mm-dd" />
    <date id="x" format="Mmm dd yyyy" />
I submit that these are *not* the same. In your example, you simply 
removed the 'inconvenient' differences.
In the database normalisation, the commonalities are "normalised" or 
"factored" out as a basis for comparison.
In this process (applied to XSLT perhaps), <date> has been 
"de-duplicated" into <standardDate> but there is no loss of information.

Why invent new terminology?

Current Thread