RE: [xsl] deduplicating information in XML files

Subject: RE: [xsl] deduplicating information in XML files
From: Robby Pelssers <Robby.Pelssers@xxxxxxx>
Date: Fri, 12 Oct 2012 16:35:15 +0200
Did I ever mention you guys rock !!!

Thx Ken and Wendell for yet again replying to a not so trivial problem.

I really need to get my outlook settings fixed... my replies too often bounce
backs.. so my gratitude is not arriving from time to time ;-(

Robby



-----Original Message-----
From: G. Ken Holman [mailto:g.ken.holman@xxxxxxxxx] On Behalf Of G. Ken
Holman
Sent: Friday, October 12, 2012 4:21 PM
To: XSL List
Subject: Re: [xsl] deduplicating information in XML files

At 2012-10-12 14:02 +0200, Robby Pelssers wrote:
>Hi all,
>
>This time I have a rather challenging task at hand.  Let me first
>describe the use case.  We have lots of product information stored in
>XML.  Some of that information describes . Technical applications .
>Features and benefits . Technical summary
>
>One of the problems is a lot of products had e.g. the same features and
>benefits as they are of the same product family or group.  But as we
>stored that info per product it got duplicated.  Now we want to
>deduplicate that info by generating DITA maps and topics (both are just
>XML).  Now for simplicity let's assume we generate the following
>content for product1 and product2.  The goal is to get from INPUT to
>OUTPUT by checking if the body of the linked topics are duplicates,
>next create 1 generic topic and rewrite the links in the map to  point
>to that single topic.  I have XSLT / XQuery
>(XMLDB) and Java at my disposal to get the job done.  I'm not sure what
>will be the easiest way to get the job done.  Keep also in mind that my
>INPUT will contain a few 1000 files (maps and linked topics) and I will
>need to deduplicate the whole set ;-)
>
>Thx upfront for any input,
>Robby
>
>INPUT
>
>Product1_map.xml
><map>
>   <features-benefits-ref href="features-benefits/Product1_FandB.xml
>"/> </map>
>
>Product1_FandB.xml:
><content>
>   <meta>
>     <id>product1</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast  switching
>characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
></content>
>
>Product2_map.xml
><map>
>   <features-benefits-ref href="features-benefits/Product2_FandB.xml
>"/> </map>
>
>Product2_FandB.xml:
><content>
>   <meta>
>     <id>product2</id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast  switching
>characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
></content>
>
>Expected output:
>
>Product1_map.xml
><map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
></map>
>
>Product2_map.xml
><map>
>   <features-benefits-ref href="features-benefits/FandB_1.xml "/>
></map>
>
>FandB_1.xml:
><content>
>   <meta>
>     <id><!- can become empty  -> </id>
>   <meta>
>   <body>
>     <p>Suitable for high frequency applications due to fast  switching
>characteristics</p>
>     <p>Suitable for logic level gate drive sources</p>
>   <body>
></content>

I hope the complete solution below in XSLT is helpful.  I see that Wendell
posted while I was working on this, and I like his idea of using the
collection() function rather than my hardwired map of maps.  I'll leave that
with you as an exercise.  You can also tweak the file name generation as you
need.  Oh, and I also added some additional data.

I was really curious about this solution.  In the classroom I teach the three
methods of grouping in XSLT 1:  by axes, by keys and by variables.  When I
talk about XSLT 2 I claim (or used to claim!) that these methods were no
longer needed.  But ... I had to use the variable method in XSLT 2 in order to
solve your requirement!  So I'll have to change my classroom materials to
reflect this.

The reason I had to use the variable-based grouping method is that the XSLT 2
<xsl:for-each-group>'s group-by= attribute is based on the value calculated,
not on the structure.  I had to use deep-equal() in order to determine if the
structure was the same.  So that ruled out <xsl:for-each-group>.  So I
instantly turned to the XSLT 1 variable-based method in order to work across
documents with an arbitrary calculation of equality, knowing that the shape of
the solution would give me what I wanted.

I think this is directly translatable to XQuery, and so I will post such a
solution to that list.

Good luck!

. . . . . . . . Ken

t:\ftemp\robby>type robby.xml
<?xml version="1.0" encoding="UTF-8"?>
<maps>
   <map href="Product1_map.xml"/>
   <map href="Product2_map.xml"/>
   <map href="Product3_map.xml"/>
   <map href="Product4_map.xml"/>
   <map href="Product5_map.xml"/>
</maps>

t:\ftemp\robby>type Product1_map.xml
<map>
   <features-benefits-ref href="features-benefits/Product1_FandB.xml"/>
</map>

t:\ftemp\robby>type Product2_map.xml
<map>
   <features-benefits-ref href="features-benefits/Product2_FandB.xml"/>
</map>

t:\ftemp\robby>type Product3_map.xml
<map>
   <features-benefits-ref href="features-benefits/Product3_FandB.xml"/>
</map>

t:\ftemp\robby>type Product4_map.xml
<map>
   <features-benefits-ref href="features-benefits/Product4_FandB.xml"/>
</map>

t:\ftemp\robby>type Product5_map.xml
<map>
   <features-benefits-ref href="features-benefits/Product5_FandB.xml"/>
</map>

t:\ftemp\robby>dir /s features-benefits
  Volume in drive T is VBOX_t
  Volume Serial Number is 0E00-0002

  Directory of t:\ftemp\robby\features-benefits

2012-10-12  08:37               235 Product1_FandB.xml
2012-10-12  08:37               235 Product2_FandB.xml
2012-10-12  08:38               286 Product3_FandB.xml
2012-10-12  08:38               285 Product4_FandB.xml
2012-10-12  08:38               285 Product5_FandB.xml
                5 File(s)          1,326 bytes

      Total Files Listed:
                5 File(s)          1,326 bytes
                0 Dir(s)  16,795,488,256 bytes free

t:\ftemp\robby>type features-benefits\Product1_FandB.xml
<content>
   <meta>
     <id>product1</id>
   </meta>
   <body>
     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
     <p>Suitable for logic level gate drive sources</p>
   </body>
</content>

t:\ftemp\robby>type features-benefits\Product2_FandB.xml
<content>
   <meta>
     <id>product2</id>
   </meta>
   <body>
     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
     <p>Suitable for logic level gate drive sources</p>
   </body>
</content>

t:\ftemp\robby>type features-benefits\Product3_FandB.xml
<content>
   <meta>
     <id>product3</id>
   </meta>
   <body>
     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
     <p>Suitable for logic level gate drive sources</p>
     <p>With additional text that is different</p>
   </body>
</content>

t:\ftemp\robby>type features-benefits\Product4_FandB.xml
<content>
   <meta>
     <id>product4</id>
   </meta>
   <body>
     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
     <p>Suitable for logic level gate drive sources</p>
     <p>With additional text that is the same</p>
   </body>
</content>

t:\ftemp\robby>type features-benefits\Product5_FandB.xml
<content>
   <meta>
     <id>product5</id>
   </meta>
   <body>
     <p>Suitable for high frequency applications due to fast switching
characteristics</p>
     <p>Suitable for logic level gate drive sources</p>
     <p>With additional text that is the same</p>
   </body>
</content>

t:\ftemp\robby>call xslt2 robby.xml robby.xsl out\robbyout.xml

t:\ftemp\robby>dir \s out
  Volume in drive T is VBOX_t
  Volume Serial Number is 0E00-0002

  Directory of t:\


  Directory of t:\ftemp\robby\out

2012-10-12  10:02    <DIR>          features-benefits
2012-10-12  10:14                94 Product1_map.xml
2012-10-12  10:14                94 Product2_map.xml
2012-10-12  10:14                84 Product3_map.xml
2012-10-12  10:14                94 Product4_map.xml
2012-10-12  10:14                94 Product5_map.xml
2012-10-12  10:14               371 robbyout.xml
                6 File(s)          1,001 bytes
                1 Dir(s)  16,795,488,256 bytes free

t:\ftemp\robby>type out\robbyout.xml
<?xml version="1.0" encoding="UTF-8"?>
<maps><!--features-benefits/Product1_FandB.xml.group.xml-->
<map href="Product1_map.xml"/>
    <map href="Product2_map.xml"/>
    <!--features-benefits/Product3_FandB.xml-->
<map href="Product3_map.xml"/>
    <!--features-benefits/Product4_FandB.xml.group.xml-->
<map href="Product4_map.xml"/>
    <map href="Product5_map.xml"/>
</maps>
t:\ftemp\robby>type out\Product1_map.xml <map>
    <features-benefits-ref
href="features-benefits/Product1_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product2_map.xml <map>
    <features-benefits-ref
href="features-benefits/Product1_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product3_map.xml <map>
    <features-benefits-ref href="features-benefits/Product3_FandB.xml"/>
</map>
t:\ftemp\robby>type out\Product4_map.xml <map>
    <features-benefits-ref
href="features-benefits/Product4_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\Product5_map.xml <map>
    <features-benefits-ref
href="features-benefits/Product4_FandB.xml.group.xml"/>
</map>
t:\ftemp\robby>type out\features-benefits\Product1_FandB.xml.group.xml
<content>
    <meta>
       <id>
<!-- - features-benefits/Product1_FandB.xml-->
<!-- - features-benefits/Product2_FandB.xml-->
</id>
    </meta>
    <body>
       <p>Suitable for high frequency applications due to fast switching
characteristics</p>
       <p>Suitable for logic level gate drive sources</p>
   </body>
</content>
t:\ftemp\robby>type out\features-benefits\Product3_FandB.xml
<content>
    <meta>
       <id/>
    </meta>
    <body>
       <p>Suitable for high frequency applications due to fast switching
characteristics</p>
       <p>Suitable for logic level gate drive sources</p>
       <p>With additional text that is different</p>
   </body>
</content>
t:\ftemp\robby>type out\features-benefits\Product4_FandB.xml.group.xml
<content>
    <meta>
       <id>
<!-- - features-benefits/Product4_FandB.xml-->
<!-- - features-benefits/Product5_FandB.xml-->
</id>
    </meta>
    <body>
       <p>Suitable for high frequency applications due to fast switching
characteristics</p>
       <p>Suitable for logic level gate drive sources</p>
       <p>With additional text that is the same</p>
   </body>
</content>
t:\ftemp\robby>type robby.xsl
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
   version="2.0">

<xsl:output indent="yes"/>

<xsl:template match="maps">
   <xsl:variable name="maps" select="map"/>
   <!--walk across all maps, acting on the first one that has unique
content-->
   <maps>
     <xsl:for-each select="$maps">
       <xsl:variable name="map-href" select="@href"/>
<!--      <xsl:message select="$map-href"/>
       <xsl:message
select="generate-id(doc(doc(@href)/*/features-benefits-ref/@href))"/>
       <xsl:message select="count(
$maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
         doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)])"/>
       <xsl:message select="
$maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
         doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)]/gen
erate-id(.)"/>
-->
       <xsl:if test="generate-id(.)=generate-id
($maps[deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
         doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body)][1])
">
         <!--found the first one of the group with this body content-->
         <xsl:variable name="current-group" select="$maps[
    deep-equal(doc(doc(@href)/*/features-benefits-ref/@href)/*/body,
               doc(doc(current()/@href)/*/features-benefits-ref/@href)/*/body
)]"/>
         <xsl:variable name="count-current-group"
                       select="count($current-group)"/>
         <xsl:variable name="new-file-href"
                   select="concat(doc($map-href)/*/features-benefits-ref/@hre
f,
                                  if( $count-current-group=1 )
                                    then '' else '.group.xml' )"/>
         <!--just for information, note this in the result map of maps-->
         <xsl:comment select="$new-file-href"/><xsl:text>&#xa;</xsl:text>
         <xsl:for-each select="$current-group">
           <!--reference the map file-->
           <map href="{@href}"/>
           <!--recreate the map file-->
           <xsl:result-document href="{@href}" omit-xml-declaration="yes">
              <map>
                <features-benefits-ref href="{$new-file-href}"/>
              </map>
           </xsl:result-document>
         </xsl:for-each>
         <!--recreate the content file-->
         <xsl:result-document href="{$new-file-href}"
                              omit-xml-declaration="yes">
           <content>
             <meta>
               <id>
                 <xsl:choose>
                   <xsl:when test="$count-current-group=1">
                     <xsl:copy-of select="node()"/>
                   </xsl:when>
                   <xsl:otherwise>
                     <xsl:for-each select="$current-group">
                       <xsl:text>&#xa;</xsl:text>
                       <xsl:comment select="string(.),
                                '-',doc(@href)/*/features-benefits-ref/@href"
/>
                     </xsl:for-each>
                     <xsl:text>&#xa;</xsl:text>
                   </xsl:otherwise>
                 </xsl:choose>
               </id>
             </meta>
             <xsl:copy-of
                select="doc(doc(@href)/*/features-benefits-ref/@href)/*/body"
/>
           </content>
         </xsl:result-document>
       </xsl:if>
     </xsl:for-each>
   </maps>
</xsl:template>

</xsl:stylesheet>

--
Contact us for world-wide XML consulting and instructor-led training Free
5-hour lecture: http://www.CraneSoftwrights.com/links/udemy.htm
Crane Softwrights Ltd.            http://www.CraneSoftwrights.com/s/
G. Ken Holman                   mailto:gkholman@xxxxxxxxxxxxxxxxxxxx
Google+ profile: https://plus.google.com/116832879756988317389/about
Legal business disclaimers:    http://www.CraneSoftwrights.com/legal

Current Thread