[xsl] Combining lists without duplication

Subject: [xsl] Combining lists without duplication
From: Roger Sperberg <rsperberg@xxxxxxxxx>
Date: Fri, 28 Sep 2007 13:10:57 -0700 (PDT)
I've assembled a list of country subdivisions and I'm wanting to
combine two
separate sources of names with this list without
duplicating the names. I'm
confused as to how best to go about it.

The
list I've got is an amalgamation
from several sources and does contain
some subdivisions not included in the
listings from ISO or BGN (the U.S.
Board of Geographic Names). I've concluded,
however, that names from
these sources should be utilized whenever possible.
I've
combined the main list and the ISO list so that each entry contains a
section along the following lines. There may or may not be a second
basename
element, with one or more iso-names:


<subdiv fips="AF13">
  <basename>
<name1>Kabul</name1>
    <name2>Kaboul</name2>
    <name3>K`bul</name3>
<name4>Kabol</name4>

  </basename> 
  <basename>
<iso-name>K`bul</iso-name>
    <iso-name2>K`bol</iso-name2>
  </basename>
</subdiv> 

An
entry in the separate BGN-names file includes information
indicating
whether it is the preferred name (nt="N") or a variant (nt="V").
Each
entry has a unique id for the name (uni) and a unique id for the
subdivision (ufi) that's shared among the variant names for that
subdivision.
Preferred names often include a short form. A form of the
name is also
included that removes all accents and diacritics
(bgn-name-nd).


Here are the
four entries in that file for the subdivision cited above:

<subdiv
ufi="-3378436" uni="-4801481" fips="AF13" nt="N" short-name="K`bol"
bgn-name="Vel`yat-e K`bol" bgn-name-nd="Velayat-e Kabol" />

<subdiv
ufi="-3378436" uni="-4801502" fips="AF13" nt="V" bgn-name="Vel`yat-e K`bul"
bgn-name-nd="Velayat-e Kabul" />
<subdiv ufi="-3378436" uni="-4801510"
fips="AF13" nt="V" bgn-name="Kabul Province" bgn-name-nd="Kabul Province" />
<subdiv ufi="-3378436" uni="523049" fips="AF13" nt="V" bgn-name="K`bol"
bgn-name-nd="Kabol" />

The result I'd like would
- use the BGN preferred
name's short form, if there is one, as the subdivision name

- if not, use the
bgn-name
- include the bgn-name and the accent-and-diacritic-free form
All
the
other names -- BGN variants, ISO names and/or variants, and names
collected
from general sources should be collected in an other-names
element, with
duplicates excluded.


In many instances, BGN includes a variant that matches
the short form of the BGN standard name. I'd like to exclude that.

I'd like
to exclude any ISO or generally collected name that matches the
accent-and-diacritic-free form of the preferred name.


And, obviously, I'd
like to exclude any ISO name that
duplicates the BGN preferred name or any BGN
variant, and exclude any
generally collected name that duplicates a BGN or ISO
name.

The result for K`bol would be:


<subdiv fips="AF13">
  <basename>
<name>K`bol</name>
    <long-form>Vel`yat-e K`bol</long-form>
<long-form-nd>Velayat-e Kabol</long-form-nd>

  </basename>
  <other-names>
<bgn-variant>Vel`yat-e K`bul</bgn-variant>
    <bgn-variant>Kabul
Province</bgn-variant>
    <iso-name>K`bul</iso-name>
<alt-name>Kabul</alt-name>
    <alt-name>Kaboul</alt-name>
<alt-name>Kabol</alt-name>
  </other-names>
</subdiv>

Whenever
no BGN entry
exists, I want to use the first ISO entry for the name,
with all other unique
names put into the other-names wrapper.


             *        *         *
When I started
working out the XSLT, I began by testing to see if a BGN name
existed.
If so, I would use the short form if available, and then add the
variants, testing to see if any of them were the same as @bgn-name-nd.
This
would handle 75 to 90 percent of the subdivisions. 

Shortly after that point,
my understanding of the correct approach began to crumble.

If
an ISO name
exists also, I can easily check it against the BGN standard
name and
bgn-name-nd, but I'm not sure what the test looks like against
the BGN
variants, if there are any. I don't see any way to use for-each
to test
against each variant. Nor can I figure out how to rely on
choose/when/otherwise without knowing how many variants there are.


I guess
there is a node-set that consists of all the subdiv
elements that have nt="V"
and a ufi attribute whose value is equal to
the bgn-standard name's ufi. But I
don't know how to compare the
iso-name against the whole group of them (as
opposed to individually
using for-each).


And then when I have added
iso-names, how do I compare each
generally collected name against the BGN and
ISO names? It must be the
same process, but now I'm getting a pretty
complicated set.

Guidance, please?


I tried searching the list archives, but
(a) I'm not sure how
to term what I'm looking for and (b) I wasn't sure that
what I found
actually applied. Just pointing me to the right section in a
reference
would be very welcome.


I'm transforming the file using Saxon B 8.9
and XSLT 2.0 so I can use the third parameter with key().

Thanks.

Roger
Sperberg
A not-too-frequent XSLT-er
Montclair, NJ 
--
Cambodian Language
Exercises -- cambodian.tiddlyspot.com
Beginning Cambodian Reader --
cambodian-reader.tiddlyspot.com

Current Thread