Hi Roland,
At first, I thought it wasn't so trivial. But when I tried to implement
it, I quickly found out that your request is actually quite easy. From
what I understand, you want to process the 15 documents one by one. If
you don't do that, and process them all at once instead, you have two
viable options in XSLT 1.0:
1. Process twice: first output all identifiers to one file plus the
name of the file it first appeared in. Then, use normal Muenchian when
you process the identifiers again (to dedup them) and, in the same run,
use the filenames to reopen the sources and select only the blocks with
the correct identifier.
2. Process them all at once with using the node-set extension
instruction from EXSLT (or if you use a microsoft processor:
msxml:nodeset) and use Muenchian grouping.
If your files are real large, either option may pose a memory problem.
The second option may yield quite a performance hit when it has to do
the nodeset transform on all documents. I suppose you test it first on a
small set and then try a larger. I tried option 2, because that seemed
the easiest to implement. In the example below I use a parameter for the
input files (set in the xslt for ease of testing). You probably have
your own preferred way to get all documents through the pipeline. Come
to think of it, if all you need are copies of these blocks, it almost
looks simpler than an attempt in XSLT 2.0 with for-each-group (that's
the first time ever I say something like that, and perhaps the only and
last time too ;)
Unfortunately, XPath 1.0 does not have the possibility to include
comments in an xpath. But for clarity, here's a little explanation on
the "core" of the little XSLT stylesheet further down.
(: node set of all documents :)
exslt:node-set($all-input)
(: all idTag nodes :)
//idTag
(: muenchian grouping :)
[generate-id(.) = generate-id(key('idtag', .)[1])]
(: get the parent block:)
/..
The stylesheet:
---------------
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
xmlns:exslt="http://exslt.org/common"
version="1.0">
<xsl:key name="idtag" match="idTag" use="." />
<xsl:output indent="yes" />
<xsl:param name="input">
<file href="muenchian-multipledocs1.xml" />
<file href="muenchian-multipledocs2.xml" />
<file href="muenchian-multipledocs3.xml" />
<file href="muenchian-multipledocs4.xml" />
</xsl:param>
<xsl:template match="/" name="main">
<xsl:variable name="all-input">
<xsl:apply-templates select="exslt:node-set($input)/*" />
</xsl:variable>
<root>
<xsl:copy-of select="
exslt:node-set($all-input)
//idTag
[generate-id(.) = generate-id(key('idtag', .)[1])]
/.."/>
</root>
</xsl:template>
<xsl:template match="file">
<xsl:copy-of select="document(@href)" />
</xsl:template>
</xsl:stylesheet>
Have fun with it!
Cheers,
-- Abel Braaksma
Meyer, Roland 1. (NSN - DE/Germany - MiniMD) wrote:
Hi,
I have the following problem. I have a couple of XML documents to merge
to one output document.
Each document has the same structure like this:
<root>
<block>
<oneTag>some value<oneTag>
<anotherTag>another value<anotherTag>
...
<idTag>setId-itemId<idTag>
</block>
<block>
...
</block>
...
</root>
I have to interpret the value in the idTag (the setId-itemId) as an
identifier for the complete structure between the block tags.
Within one document this identifying value comes only once, but the same
identifying value can be found in different documents.
What I now need:
My output file should list each block only ones, means although the same
identifying value is present in different input documents, it should
appear only once in the output document.
I can think about some heavy procedures by checking every found
identifier value in the already processed files (because then they are
already written to the output), but this will be very time consuming (I
have around 15 files with each up to 10000 blocks).
Is there any other and simpler way to - let's say - memorize the already
written blocks resp. identifiers?
Best Regards,
Roland