Re: [xsl] Identical entries in different input documents should appear in the output document only once

Subject: Re: [xsl] Identical entries in different input documents should appear in the output document only once
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 07 Sep 2007 21:54:59 +0200
Hi Roland,

At first, I thought it wasn't so trivial. But when I tried to implement it, I quickly found out that your request is actually quite easy. From what I understand, you want to process the 15 documents one by one. If you don't do that, and process them all at once instead, you have two viable options in XSLT 1.0:

1. Process twice: first output all identifiers to one file plus the name of the file it first appeared in. Then, use normal Muenchian when you process the identifiers again (to dedup them) and, in the same run, use the filenames to reopen the sources and select only the blocks with the correct identifier.

2. Process them all at once with using the node-set extension instruction from EXSLT (or if you use a microsoft processor: msxml:nodeset) and use Muenchian grouping.

If your files are real large, either option may pose a memory problem. The second option may yield quite a performance hit when it has to do the nodeset transform on all documents. I suppose you test it first on a small set and then try a larger. I tried option 2, because that seemed the easiest to implement. In the example below I use a parameter for the input files (set in the xslt for ease of testing). You probably have your own preferred way to get all documents through the pipeline. Come to think of it, if all you need are copies of these blocks, it almost looks simpler than an attempt in XSLT 2.0 with for-each-group (that's the first time ever I say something like that, and perhaps the only and last time too ;)

Unfortunately, XPath 1.0 does not have the possibility to include comments in an xpath. But for clarity, here's a little explanation on the "core" of the little XSLT stylesheet further down.

   (: node set of all documents :)
   exslt:node-set($all-input)

   (: all idTag nodes :)
   //idTag

   (: muenchian grouping :)
   [generate-id(.) = generate-id(key('idtag', .)[1])]

   (: get the parent block:)
   /..


The stylesheet: ---------------

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:exslt="http://exslt.org/common";
version="1.0">
<xsl:key name="idtag" match="idTag" use="." />
<xsl:output indent="yes" />
<xsl:param name="input">
<file href="muenchian-multipledocs1.xml" />
<file href="muenchian-multipledocs2.xml" />
<file href="muenchian-multipledocs3.xml" />
<file href="muenchian-multipledocs4.xml" />
</xsl:param>
<xsl:template match="/" name="main">
<xsl:variable name="all-input">
<xsl:apply-templates select="exslt:node-set($input)/*" />
</xsl:variable>
<root>
<xsl:copy-of select="
exslt:node-set($all-input)
//idTag
[generate-id(.) = generate-id(key('idtag', .)[1])]
/.."/>
</root>
</xsl:template>
<xsl:template match="file">
<xsl:copy-of select="document(@href)" />
</xsl:template>


</xsl:stylesheet>


Have fun with it!


Cheers,
-- Abel Braaksma


Meyer, Roland 1. (NSN - DE/Germany - MiniMD) wrote:
Hi,

I have the following problem. I have a couple of XML documents to merge
to one output document. Each document has the same structure like this:


<root>
  <block>
    <oneTag>some value<oneTag>
    <anotherTag>another value<anotherTag>
     ...
    <idTag>setId-itemId<idTag>
  </block>
  <block>
     ...
  </block>
   ...
</root>

I have to interpret the value in the idTag (the setId-itemId) as an
identifier for the complete structure between the block tags.
Within one document this identifying value comes only once, but the same
identifying value can be found in different documents.

What I now need:
My output file should list each block only ones, means although the same
identifying value is present in different input documents, it should
appear only once in the output document.

I can think about some heavy procedures by checking every found
identifier value in the already processed files (because then they are
already written to the output), but this will be very time consuming (I
have around 15 files with each up to 10000 blocks).


Is there any other and simpler way to - let's say - memorize the already
written blocks resp. identifiers?

Best Regards,
Roland

Current Thread