Re: [xsl] Extraction of data using key() and matches()

Subject: Re: [xsl] Extraction of data using key() and matches()
From: Michael Kay <mike@xxxxxxxxxxxx>
Date: Sat, 05 Jun 2010 22:42:34 +0100
On 05/06/2010 20:02, Jakob Fix wrote:

I have a large number of XML data files which contain a table with
rows and data cells each (previously Excel files).

I'm interested in finding out whether in the table's data cells there
is or is not a given country name. If so I want to record in another
file all country names that appear in the data file. The country name
may be the only content of the data cell (<col>United Kingdom</col>),
or it may be surrounded by other text (<col>Data has been provided for
United Kingdom only.</col>). It can also be that more than one country
name appears in a table cell. There won't be other elements in the
cell, just character data.

My current approach is to have an exhaustive lookup files with *all*
country names that are potentially used. For each XML data file, I
loop over all country names and query the contents of each data file
whether it matches the current country name.

You could create an index on all the "words" in the text using

<xsl:key name="words" match="col" use="tokenize(., '\P{L}+')"/>

where a word is defined as a maximal sequence of "letter" characters.

Then to see whether a given country is present you could start by testing whether the first word of the country name is present:

key('words', tokenize($country, '\P{L}+')[1])

and then apply a more sensitive test to the result of this first filter.

Michael Kay

Current Thread