[xsl] marking up text when term from other file is found

Subject: [xsl] marking up text when term from other file is found
From: Hoskins & Gretton <hoskgret@xxxxxxxxxxxxxxxx>
Date: Wed, 21 Apr 2010 23:08:03 -0400
HI, I need help finding resources (examples and/or XSL) for this situation, for which I haven't found quite the right recipe in my searches of the list archives.
Given an XML file containing a list of terms and another file containing a mix of elements containing text (narrative content, some inline markup for emphasis and footnotes), I was asked if I could find occurrences of each term wherever it appeared in the narrative content, and wrap each occurrence with a tag. So my first thought is to load up each document into a variable. But then I don't know what the most effective method of string comparison would be, given that the narrative document might have the term's words with different capitalization. If anyone can point me in the right direction, I'd appreciate it. Also I would like to know if there is a practical limit to how large a narrative file I can run with about 150 terms to find in the text. And if a different approach would work better, such as writing Java to do brute force search and replace, please tell me so. (I work with a Java programmer. Everything looks like a Java problem to her and an XSL problem to me.)
-- Dorothy
Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad sentence as an example.
Example of terms (indexTerms.xml):
<?xml version="1.0" encoding="UTF-8"?>
<terms>
<term index1="anxiety">Anxiety</term>
<term index1="children">Children</term>
<term index1="children" index2="illness">Children, illness</term>
<term index1="children" index2="nightmare">Children, nightmare</term>
<term index1="cure" index2="fever">Cure fever</term>
<term index1="cure" index2="illness">Cure illness</term>
<term index1="anxiety" index2="nightmare">Nightmare</term>
<term index1="children" index2="illness">Sick children</term>
<term index1="anxiety" index2="phobia">Worries, phobias and anxiety</term>
</terms>


Example of narrative (sampleTopic.xml):
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
<topic id="sampleTopic">
<title>sampleTopic</title>
<body>
<p>markup for sample terms testing a set of phrases to match to the content of index terms:</p>
<p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e. <ph id="cure_fever">curing fever</ph>, <ph id="children_illness">sick children</ph> and sometime the same terms occur, <i>but different case</i>, not in a ph: Curing fever and <b>Sick children</b>. I need to get all the occurrences of each of the term element strings marked up with &lt;ph&gt; </p>
</body>
</topic>


Desired result:
<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
<topic id="sampleTopic">
<title>sampleTopic</title>
<body>
<p>markup for sample terms testing a set of phrases to match to the content of index terms:</p>
<p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e. <ph id="cure_fever">curing fever</ph>, <ph id="children_illness">sick children</ph> and sometime the same terms occur, <i>but different case</i>, not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph id="children_illness">Sick children</ph></b>. I need to get all the occurrences of each of the term element strings marked up with &lt;ph&gt; </p>
</body>
</topic>


XSL:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"; version="2.0">
<xsl:param name="indexFile">indexTerms.xml</xsl:param>
<xsl:param name="textFile">sampleTopic.xml</xsl:param>
<xsl:variable name="termsDocument" select="document($indexFile)"></xsl:variable>
<xsl:variable name="textDocument" select="document($textFile)"></xsl:variable>
<xsl:template match="*" name="test1"><xsl:result-document href="matchText-test.xml" method="xml">
<!-- proof that I can get the terms -->
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>first term is </xsl:text><xsl:value-of select="$termsDocument/terms/term[1]"/></xsl:comment>
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>second term is </xsl:text><xsl:value-of select="$termsDocument/terms/term[2]"/></xsl:comment>
<xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>third term is </xsl:text><xsl:value-of select="$termsDocument/terms/term[3]"/></xsl:comment>
<!-- now how to I find them in the $textDocument elements and mark them up? -->
</xsl:result-document>
</xsl:template>
</xsl:stylesheet>


Current Thread