Re: Removing duplicate elements a-priori?

Subject: Re: Removing duplicate elements a-priori?
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sun, 18 Jun 2000 10:14:22 +0100
Gordon,

>	I'm trying to remove duplicate elements from an output document.  I've 
>seen the examples for doing this in the archives, but they all seem to 
>assume some knowledge of the structure of the input document.  For example:
>
>	<xsl:for-each select="item[not(.=preceding-sibling::item)]"> 	
>	  <xsl:sort/>
>
>	But with this you need to know that there is an element named "item" in 
>the input.

Just to clarify, the example that you quote:

	For example I want to take :

	<doc>
	<employee>Bill</employee>
	<employee>Andy</employee>
	<employee>John</employee>	
	</doc>

	And produce just :

	<doc>
	<employee>Bill</employee>
	</doc>

doesn't involve removing duplicate items - it involves identifying the first element of a certain type within a document.

I'm going to assume you *were* actually referring to removing duplicate elements and, to make the answer more general and more accurate, I'm also going to assume that you have a number of different elements within your content.  Finally, I'm going to assume that you do know that the thing that it is the content of the element that makes it a duplicate (rather than the value of an attribute, say), so something like:

	<doc>
	<employee>Bill</employee>
	<employee>Andy</employee>
       <director>Amy</director>
	<employee>Bill</employee>
       <director>Louise</director>	
       <director>Louise</director>	
	<employee>Bill</employee>
	<employee>Andy</employee>
	<employee>John</employee>	
       <director>Amy</director>
       <director>Louise</director>	
	</doc>

To produce something like:

	<doc>
	<employee>Bill</employee>
	<employee>Andy</employee>
       <director>Amy</director>
       <director>Louise</director>	
	<employee>John</employee>
	</doc>

Rather than using the preceding-sibling axis, I'm going to use the Muenchian technique to identify the first unique elements, because it's a lot easier to use in this case, as well as being more efficient generally.

First, define a key so that you can index on the unique features of the particular elements that you want.  In this case, there are two unique features: the name of the element, and the content of the element.  To make a key that includes both, I'm concatenating these two bits of information together (with a separator to hopefully account for odd occurrances that could generate the same key despite having different element/content combinations):

<xsl:key name="elements" match="*" use="concat(name(), '::', .)" />

So all the <employee>Bill</employee> elements are indexed under 'employee::Bill'.  The unique elements are those that appear first in the list of elements that are indexed by the same key.  Identifying those involves testing to see whether the node you're currently looking at is the same node as the first node in the list that is indexed by the key for the node.  So if the <employee>Bill</employee> node that we're looking at is the first one in the list that we get when we retrieve the 'employee::Bill' nodes from the 'elements' key, then we know it hasn't been processed before.

<xsl:template match="doc">
  <xsl:for-each select="*[generate-id(.) =
      generate-id(key('elements', concat(name(), '::', .))[1])]">
    <xsl:copy-of select="." />
  </xsl:for-each>
</xsl:template>

This is tested and gives the desired output for the sample input in SAXON.

I hope this helps,

Jeni

Jeni Tennison
http://friday.u-net.com/jeni/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread