Re: [xsl] combine xml files

Subject: Re: [xsl] combine xml files
From: "Thomas B. Passin" <tpassin@xxxxxxxxxxxx>
Date: Thu, 11 Apr 2002 15:28:32 -0400
[Ming]

> Hi, Tom,
>
> You are really good! All your assumptions are correct.
>

OK, then.  This will be a bit long, but it's not complicated.  I'm going to
leave the task of iterating over the individual files to you - you already
have received some suggestions - and just give you one solution to getting
titles and names from a single xml file according to their respective db
priorities.

I won't claim that this is the most efficient stylesheet.  I'm sure others
on the list, like Jeni or Mike Kay, can come up with a more efficient
approach.  I'm after simplicity to give you a starting point that you can
easily modify to suit your needs.

First, I made a few modifications to your xml file - I quoted the attributes
to make it well-formed, and I added "db2" to the titles for db2 so we can
tell if our preferences are being followed when we look at the output.  I
also changed the name of the root element from "xml" to "record". Here is
the resulting source file:

========= Source XML file =========
<record>
  <db1>
     <jauthor>
        <author db="db1"> Smith, J</author>
        <author db="db1"> Mou, S </author>
    </jauthor>
    <jtitle>
       <title db="db1"> Preliminary study on network (II)(db1) </title>
    </jtitle>
  </db1>

  <db2>
     <jauthor>
       <author db="db2"> Smith, JR </author>
       <author db="db2"> Mou, ST </author>
     </jauthor>
     <jtitle>
       <title db="db2"> Preliminary Study on Network (II)(db2) </title
     </jtitle>
  </db2>
</record>
====================================

Next, I created an xml file for the db preferences.  The priorities are to
be applied in their document order:

========= File db_prefs.xml ===========
<dbprefs>
 <titles>
  <pref name='db2'/>
  <pref name='db1'/>
  <pref name='db3'/>
 </titles>
 <authors>
  <pref name='db1'/>
  <pref name='db2'/>
  <pref name='db3'/>
 </authors>
</dbprefs>
==================================
Notice that I used different priorities for titles and authors, and I
included a third db to demonstrate that we don't return results for a db for
which we have no data.

Here is the stylesheet, part by part with comments:

======== Stylesheet ===============
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<!-- Variable so we can refer to the source document -->
<xsl:variable name='record' select='/record'/>

<!-- Variables for our priorities, gotten from the prefs file -->
<xsl:variable name='title-prefs'
 select='document("db_prefs.xml")/dbprefs/titles/pref'/>

<xsl:variable name='author-prefs'
 select='document("db_prefs.xml")/dbprefs/authors/pref'/>

<!--================================
    It's more readable to define the -pref variables here than simply to use
them inline where we need them.
===================================-->

<!--=================================
    Get the title first, then the authors.  We do them separately so we can
apply the appropriate priorities to each.  This also makes it easy to deal
with the fact that there can be several authors for a work (since we don't
intend to pull one author from db1 and another from db2, for example).
=================================== -->
<xsl:template match="/record">
<results>
 <title>
  <xsl:call-template name='get-title'/>
 </title>

 <authors>
  <xsl:call-template name='get-authors'/>
 </authors>
</results>
</xsl:template>

<!--==================================
    The key point here is to get the titles in order of their db priority.
Then the first one will automatically have the highest priority.  How can we
separate it out from the other possible titles from other dbs?  My approach
is a bit of a hack, but simple.  I return a single string with all the
titles concatenated with \\\ between them.  I get the first one by using
substring-before().  Of course, using \\\ is arbitrary, any separator would
do that isn't going to show up in the titles (as I said, a bit of a hack but
it makes things nice and simple).

This is easier than creating machinery to continue iterating through all the
titles only if we have not already found one.  I doubt that the extra time
to iterate through them all is enough to be harmful, considering that we can
avoid testing on each iteration, but this point could be tested.
========================================-->
<xsl:template name='get-title'>
 <xsl:variable name='title-results'>

    <!-- ============================
        Here is where we apply the priorities.  We use xsl:for-each to go
through them in order
    ==============================-->
  <xsl:for-each select='$title-prefs/@name'>
   <xsl:variable name='db' select='.'/>

    <!--=================================
        Here we get all the titles, regardless of which db they are grouped
with.  If you didn't use the "db" attribute we'd have to change the approach
a bit to carry the db information along.  The way you have done it makes
this easier to do.
    ==================================== -->
   <xsl:variable name='title' select='$record/*/jtitle/title[@db=$db]'/>
   <xsl:if test='$title'><xsl:value-of
select='concat($title,"\\\")'/></xsl:if>
  </xsl:for-each>
 </xsl:variable>

    <!--====================================
            The hack exposed!
    ======================================-->
 <xsl:value-of select='substring-before($title-results,"\\\")'/>
</xsl:template>

<!--===============================
    I treat the authors the same way, but it's a bit harder because there
may be more than one author and you may want to apply some formatting
between their names.  Here, I just insert two non-breaking spaces between
the names.  Otherwise, they are handled just like the titles.

In particular, the authors, all of them, are returned as a single string.
If you need to break them out into separate elements, you may have to
convert them to a node-set so you can return just the first one (e.g.,
authors[1]).  If so, you have to make sure to use an xslt processor that has
an convert-to-node-set extension.
======================================-->
<xsl:template name='get-authors'>
 <xsl:variable name='author-results'>
  <xsl:for-each select='$author-prefs/@name'>
   <xsl:variable name='db' select='.'/>
   <xsl:variable name='authors' select='$record/*/jauthor/author[@db=$db]'/>

   <xsl:if test='$authors'>
    <xsl:for-each select='$authors'>
     <xsl:value-of select='.'/>&#160;&#160;
    </xsl:for-each>\\\
   </xsl:if>

  </xsl:for-each>
 </xsl:variable>
 <xsl:value-of select='substring-before($author-results,"\\\")'/>
</xsl:template>

<!--==================================-->
</xsl:stylesheet>

===========================================

And here are the results, with some whitespace changed for visual
formatting:

<results>
<title> Preliminary Study on Network (II)(db2) </title>
<authors> Smith, J    Mou, S   </authors>
/results>

You see that we got the title from db2 and the authors from db1, as required
by the priorities in the db_prefs.xml file.

In practice, you will either want to list all the files in a driver file and
run them through the stylesheet in one invocation, or you will want to
compile the stylesheet and keep it it memory so that it does not have to be
rebuilt for each xml file.  You don't want to invoke the stylesheet
separately for each file since that would take a long time, considering that
you may have a lot of files to process.

This method will not work if you concatenate all the separate xml files,
since it relys on getting a single result from a single file.  But it may
give you ideas for handling a concatenated file if you end up wanting to try
that out.  I don't think you will need to.

I suggest that you use a very simple output format, like this one, while you
are developing the method for processing all the files.  Once everything
works, and is fast enough, you can tune up the stylesheet to produce HTML or
whatever you want.  Keep it as simple as possible for as long as possible.

Cheers,

Tom P


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread