Subject: [xsl] Extraction of data using key() and matches()
From: Jakob Fix <jakob.fix@xxxxxxxxx>
Date: Sat, 5 Jun 2010 21:02:20 +0200

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.

The following works but is rather slow:


  <country code="ABW">
  <country code="AFG">


    <name><![CDATA[Figure 1.1 (I)]]></name>
    <row number="0">
      <col number="0"><![CDATA[United Kingdom]]></col>
    <row number="1">
      <col number="0"><![CDATA[Part I. ]]></col>
      <col number="1"><![CDATA[These data apply to France, Germany and
a couple of other countries.]]></col>


<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/>
  <xsl:if test="$data-doc//col[matches(., $current-node/text())]">
    <country><xsl:value-of select="$current-node/../@code"/></country>

In order to speed up the process I was thinking about indexing all
data cells using xsl:key. However, I cannot see how the key() and the
matches() function can be combined to use the former's speed with the
latter's regex power.

I was hoping of doing something along these lines, but would need some
help as this doesn't currently work:

<xsl:key name="cell" match="col" use="text()"/><!-- create an index of
the cells' contents -->

<xsl:for-each select="document($country-file)/countries/country/en">
  <xsl:variable name="current-node" select="."/><!-- don't lose the
current node -->
  <xsl:for-each select="document($data-file)"><!-- change context to
data document -->
    <!-- key returns a nodeset, so count the number of nodes in the nodeset.
          this doesn't work if the country name is not the only content -->
    <xsl:if test="count(key("cell", $current-node)) > 0">
      <country><xsl:value-of select="$current-node/../@code"/></country>

Maybe there's another solution that is more elegant and more efficient
than what I've shown above. I'd love to know about it.  Thank you in
advance for your help.


