Subject: RE: [xsl] deduplicating information in XML files From: Robby Pelssers <Robby.Pelssers@xxxxxxx> Date: Fri, 12 Oct 2012 16:35:15 +0200 |
Did I ever mention you guys rock !!! Thx Ken and Wendell for yet again replying to a not so trivial problem. I really need to get my outlook settings fixed... my replies too often bounce backs.. so my gratitude is not arriving from time to time ;-( Robby -----Original Message----- From: G. Ken Holman [mailto:g.ken.holman@xxxxxxxxx] On Behalf Of G. Ken Holman Sent: Friday, October 12, 2012 4:21 PM To: XSL List Subject: Re: [xsl] deduplicating information in XML files At 2012-10-12 14:02 +0200, Robby Pelssers wrote: >Hi all, > >This time I have a rather challenging task at hand. Let me first >describe the use case. We have lots of product information stored in >XML. Some of that information describes . Technical applications . >Features and benefits . Technical summary > >One of the problems is a lot of products had e.g. the same features and >benefits as they are of the same product family or group. But as we >stored that info per product it got duplicated. Now we want to >deduplicate that info by generating DITA maps and topics (both are just >XML). Now for simplicity let's assume we generate the following >content for product1 and product2. The goal is to get from INPUT to >OUTPUT by checking if the body of the linked topics are duplicates, >next create 1 generic topic and rewrite the links in the map to point >to that single topic. I have XSLT / XQuery >(XMLDB) and Java at my disposal to get the job done. I'm not sure what >will be the easiest way to get the job done. Keep also in mind that my >INPUT will contain a few 1000 files (maps and linked topics) and I will >need to deduplicate the whole set ;-) > >Thx upfront for any input, >Robby > >INPUT > >Product1_map.xml ><map> > <features-benefits-ref href="features-benefits/Product1_FandB.xml >"/> </map> > >Product1_FandB.xml: ><content> > <meta> > <id>product1</id> > <meta> > <body> > <p>Suitable for high frequency applications due to fast switching >characteristics</p> > <p>Suitable for logic level gate drive sources</p> > <body> ></content> > >Product2_map.xml ><map> > <features-benefits-ref href="features-benefits/Product2_FandB.xml >"/> </map> > >Product2_FandB.xml: ><content> > <meta> > <id>product2</id> > <meta> > <body> > <p>Suitable for high frequency applications due to fast switching >characteristics</p> > <p>Suitable for logic level gate drive sources</p> > <body> ></content> > >Expected output: > >Product1_map.xml ><map> > <features-benefits-ref href="features-benefits/FandB_1.xml "/> ></map> > >Product2_map.xml ><map> > <features-benefits-ref href="features-benefits/FandB_1.xml "/> ></map> > >FandB_1.xml: ><content> > <meta> > <id><!- can become empty -> </id> > <meta> > <body> > <p>Suitable for high frequency applications due to fast switching >characteristics</p> > <p>Suitable for logic level gate drive sources</p> > <body> ></content> I hope the complete solution below in XSLT is helpful. I see that Wendell posted while I was working on this, and I like his idea of using the collection() function rather than my hardwired map of maps. I'll leave that with you as an exercise. You can also tweak the file name generation as you need. Oh, and I also added some additional data. I was really curious about this solution. In the classroom I teach the three methods of grouping in XSLT 1: by axes, by keys and by variables. When I talk about XSLT 2 I claim (or used to claim!) that these methods were no longer needed. But ... I had to use the variable method in XSLT 2 in order to solve your requirement! So I'll have to change my classroom materials to reflect this. The reason I had to use the variable-based grouping method is that the XSLT 2 <xsl:for-each-group>'s group-by= attribute is based on the value calculated, not on the structure. I had to use deep-equal() in order to determine if the structure was the same. So that ruled out <xsl:for-each-group>. So I instantly turned to the XSLT 1 variable-based method in order to work across documents with an arbitrary calculation of equality, knowing that the shape of the solution would give me what I wanted. I think this is directly translatable to XQuery, and so I will post such a solution to that list. Good luck! . . . . . . . . Ken t:\ftemp\robby>type robby.xml <?xml version="1.0" encoding="UTF-8"?> <maps> <map href="Product1_map.xml"/> <map href="Product2_map.xml"/> <map href="Product3_map.xml"/> <map href="Product4_map.xml"/> <map href="Product5_map.xml"/> </maps> t:\ftemp\robby>type Product1_map.xml <map> <features-benefits-ref href="features-benefits/Product1_FandB.xml"/> </map> t:\ftemp\robby>type Product2_map.xml <map> <features-benefits-ref href="features-benefits/Product2_FandB.xml"/> </map> t:\ftemp\robby>type Product3_map.xml <map> <features-benefits-ref href="features-benefits/Product3_FandB.xml"/> </map> t:\ftemp\robby>type Product4_map.xml <map> <features-benefits-ref href="features-benefits/Product4_FandB.xml"/> </map> t:\ftemp\robby>type Product5_map.xml <map> <features-benefits-ref href="features-benefits/Product5_FandB.xml"/> </map> t:\ftemp\robby>dir /s features-benefits Volume in drive T is VBOX_t Volume Serial Number is 0E00-0002 Directory of t:\ftemp\robby\features-benefits 2012-10-12 08:37 235 Product1_FandB.xml 2012-10-12 08:37 235 Product2_FandB.xml 2012-10-12 08:38 286 Product3_FandB.xml 2012-10-12 08:38 285 Product4_FandB.xml 2012-10-12 08:38 285 Product5_FandB.xml 5 File(s) 1,326 bytes Total Files Listed: 5 File(s) 1,326 bytes 0 Dir(s) 16,795,488,256 bytes free t:\ftemp\robby>type features-benefits\Product1_FandB.xml <content> <meta> <id>product1</id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> </body> </content> t:\ftemp\robby>type features-benefits\Product2_FandB.xml <content> <meta> <id>product2</id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> </body> </content> t:\ftemp\robby>type features-benefits\Product3_FandB.xml <content> <meta> <id>product3</id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> <p>With additional text that is different</p> </body> </content> t:\ftemp\robby>type features-benefits\Product4_FandB.xml <content> <meta> <id>product4</id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> <p>With additional text that is the same</p> </body> </content> t:\ftemp\robby>type features-benefits\Product5_FandB.xml <content> <meta> <id>product5</id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> <p>With additional text that is the same</p> </body> </content> t:\ftemp\robby>call xslt2 robby.xml robby.xsl out\robbyout.xml t:\ftemp\robby>dir \s out Volume in drive T is VBOX_t Volume Serial Number is 0E00-0002 Directory of t:\ Directory of t:\ftemp\robby\out 2012-10-12 10:02 <DIR> features-benefits 2012-10-12 10:14 94 Product1_map.xml 2012-10-12 10:14 94 Product2_map.xml 2012-10-12 10:14 84 Product3_map.xml 2012-10-12 10:14 94 Product4_map.xml 2012-10-12 10:14 94 Product5_map.xml 2012-10-12 10:14 371 robbyout.xml 6 File(s) 1,001 bytes 1 Dir(s) 16,795,488,256 bytes free t:\ftemp\robby>type out\robbyout.xml <?xml version="1.0" encoding="UTF-8"?> <maps><!--features-benefits/Product1_FandB.xml.group.xml--> <map href="Product1_map.xml"/> <map href="Product2_map.xml"/> <!--features-benefits/Product3_FandB.xml--> <map href="Product3_map.xml"/> <!--features-benefits/Product4_FandB.xml.group.xml--> <map href="Product4_map.xml"/> <map href="Product5_map.xml"/> </maps> t:\ftemp\robby>type out\Product1_map.xml <map> <features-benefits-ref href="features-benefits/Product1_FandB.xml.group.xml"/> </map> t:\ftemp\robby>type out\Product2_map.xml <map> <features-benefits-ref href="features-benefits/Product1_FandB.xml.group.xml"/> </map> t:\ftemp\robby>type out\Product3_map.xml <map> <features-benefits-ref href="features-benefits/Product3_FandB.xml"/> </map> t:\ftemp\robby>type out\Product4_map.xml <map> <features-benefits-ref href="features-benefits/Product4_FandB.xml.group.xml"/> </map> t:\ftemp\robby>type out\Product5_map.xml <map> <features-benefits-ref href="features-benefits/Product4_FandB.xml.group.xml"/> </map> t:\ftemp\robby>type out\features-benefits\Product1_FandB.xml.group.xml <content> <meta> <id> <!-- - features-benefits/Product1_FandB.xml--> <!-- - features-benefits/Product2_FandB.xml--> </id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> </body> </content> t:\ftemp\robby>type out\features-benefits\Product3_FandB.xml <content> <meta> <id/> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> <p>With additional text that is different</p> </body> </content> t:\ftemp\robby>type out\features-benefits\Product4_FandB.xml.group.xml <content> <meta> <id> <!-- - features-benefits/Product4_FandB.xml--> <!-- - features-benefits/Product5_FandB.xml--> </id> </meta> <body> <p>Suitable for high frequency applications due to fast switching characteristics</p> <p>Suitable for logic level gate drive sources</p> <p>With additional text that is the same</p> </body> </content> t:\ftemp\robby>type robby.xsl <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"> <xsl:output indent="yes"/> <xsl:template match="maps"> <xsl:variable name="maps" select="map"/> <!--walk across all maps, acting on the first one that has unique content--> <maps> <xsl:for-each select="$maps"> <xsl:variable name="map-href" select="@href"/> <!-- <xsl:message select="$map-href"/> <xsl:message select="generate-id(doc(doc(@href)/*/features-benefits-ref/@href))"/> <xsl:message select="count( $maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body, doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)])"/> <xsl:message select=" $maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body, doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)]/gen erate-id(.)"/> --> <xsl:if test="generate-id(.)=generate-id ($maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body, doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)][1]) "> <!--found the first one of the group with this body content--> <xsl:variable name="current-group" select="$maps[ deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body, doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body )]"/> <xsl:variable name="count-current-group" select="count($current-group)"/> <xsl:variable name="new-file-href" select="concat(doc($map-href)/*/features-benefits-ref/@hre f, if( $count-current-group=1 ) then '' else '.group.xml' )"/> <!--just for information, note this in the result map of maps--> <xsl:comment select="$new-file-href"/><xsl:text>
</xsl:text> <xsl:for-each select="$current-group"> <!--reference the map file--> <map href="{@href}"/> <!--recreate the map file--> <xsl:result-document href="{@href}" omit-xml-declaration="yes"> <map> <features-benefits-ref href="{$new-file-href}"/> </map> </xsl:result-document> </xsl:for-each> <!--recreate the content file--> <xsl:result-document href="{$new-file-href}" omit-xml-declaration="yes"> <content> <meta> <id> <xsl:choose> <xsl:when test="$count-current-group=1"> <xsl:copy-of select="node()"/> </xsl:when> <xsl:otherwise> <xsl:for-each select="$current-group"> <xsl:text>
</xsl:text> <xsl:comment select="string(.), '-',doc(@href)/*/features-benefits-ref/@href" /> </xsl:for-each> <xsl:text>
</xsl:text> </xsl:otherwise> </xsl:choose> </id> </meta> <xsl:copy-of select="doc(doc(@href)/*/features-benefits-ref/@href)/*/body" /> </content> </xsl:result-document> </xsl:if> </xsl:for-each> </maps> </xsl:template> </xsl:stylesheet> -- Contact us for world-wide XML consulting and instructor-led training Free 5-hour lecture: http://www.CraneSoftwrights.com/links/udemy.htm Crane Softwrights Ltd. http://www.CraneSoftwrights.com/s/ G. Ken Holman mailto:gkholman@xxxxxxxxxxxxxxxxxxxx Google+ profile: https://plus.google.com/116832879756988317389/about Legal business disclaimers: http://www.CraneSoftwrights.com/legal
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] deduplicating information, G. Ken Holman | Thread | Re: [xsl] deduplicating information, G. Ken Holman |
Re: [xsl] deduplicating information, G. Ken Holman | Date | Re: [xsl] deduplicating information, G. Ken Holman |
Month |