Re: [xsl] marking up text when term from other file is found

Subject: Re: [xsl] marking up text when term from other file is found
From: Mukul Gandhi <gandhi.mukul@xxxxxxxxx>
Date: Thu, 22 Apr 2010 11:51:10 +0530
I would try to solve this as, following:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                       version="2.0">

  <xsl:output method="xml" indent="yes" />

  <xsl:variable name="index-terms" select="document('indexTerms.xml')" />

  <xsl:template match="node() | @*">
    <xsl:copy>
	  <xsl:apply-templates select="node() | @*" />
	</xsl:copy>
  </xsl:template>

  <xsl:template match="text()" priority="10">
	 <xsl:analyze-string select="."
	                     regex="{string-join(for $term in
$index-terms/terms/term return concat('(', $term, ')'), '|')}">
	    <xsl:matching-substring>
		 <xsl:variable name="idVal" select="string-join(for $attrVal in
$index-terms/terms/term[. =
regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')"
/>
		 <ph id="{$idVal}">
		     <xsl:value-of select="." />
		 </ph>
           </xsl:matching-substring>
	   <xsl:non-matching-substring>
	       <xsl:value-of select="." />
           </xsl:non-matching-substring>
	 </xsl:analyze-string>
  </xsl:template>

</xsl:stylesheet>

You may adapt this, to suit your requirements if needed.

On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton
<hoskgret@xxxxxxxxxxxxxxxx> wrote:
>
> HI, I need help finding resources (examples and/or XSL) for this situation,
> for which I haven't found quite the right recipe in my searches of the list
> archives.
> Given an XML file containing a list of terms and another file containing a
> mix of elements containing text (narrative content, some inline markup for
> emphasis and footnotes), I was asked if I could find occurrences of each
> term wherever it appeared in the narrative content, and wrap each
occurrence
> with a tag. So my first thought is to load up each document into a
variable.
> But then I don't know what the most effective method of string comparison
> would be, given that the narrative document might have the term's words
with
> different capitalization. If anyone can point me in the right direction,
I'd
> appreciate it. Also I would like to know if there is a practical limit to
> how large a narrative file I can run with about 150 terms to find in the
> B text. And if a different approach B would work better, such as writing
Java
> to do B brute force search and replace, please tell me so. (I work with a
> Java programmer. Everything looks like a Java problem to her and an XSL
> problem to me.)
> -- Dorothy
> Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad
> sentence as an example.
> Example of terms (indexTerms.xml):
> <?xml version="1.0" encoding="UTF-8"?>
> <terms>
> B  <term index1="anxiety">Anxiety</term>
> B  <term index1="children">Children</term>
> B  <term index1="children" index2="illness">Children, illness</term>
> B  <term index1="children" index2="nightmare">Children, nightmare</term>
> B  <term index1="cure" index2="fever">Cure fever</term>
> B  <term index1="cure" index2="illness">Cure illness</term>
> B  <term index1="anxiety" index2="nightmare">Nightmare</term>
> B  <term index1="children" index2="illness">Sick children</term>
> B  <term index1="anxiety" index2="phobia">Worries, phobias and
anxiety</term>
> </terms>
>
> Example of narrative (sampleTopic.xml):
> <?xml version='1.0' encoding='UTF-8'?>
> <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
> <topic id="sampleTopic">
> B <title>sampleTopic</title>
> B <body>
> B  B <p>markup for sample terms testing a set of phrases to match to the
> content of index terms:</p>
> B  B <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> children</ph> and sometime the same terms occur, <i>but different case</i>,
> not in a ph: Curing fever and <b>Sick children</b>. I need to get all the
> occurrences of each of the term element strings marked up with &lt;ph&gt;
> </p>
> B </body>
> </topic>
>
> Desired result:
> <?xml version='1.0' encoding='UTF-8'?>
> <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
> <topic id="sampleTopic">
> B <title>sampleTopic</title>
> B <body>
> B  B <p>markup for sample terms testing a set of phrases to match to the
> content of index terms:</p>
> B  B <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> children</ph> and sometime the same terms occur, <i>but different case</i>,
> not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph
> id="children_illness">Sick children</ph></b>. I need to get all the
> occurrences of each of the term element strings marked up with &lt;ph&gt;
> </p>
> B </body>
> </topic>
>
> XSL:
> <?xml version="1.0" encoding="UTF-8"?>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> version="2.0">
> <xsl:param name="indexFile">indexTerms.xml</xsl:param>
> <xsl:param name="textFile">sampleTopic.xml</xsl:param>
> <xsl:variable name="termsDocument"
> select="document($indexFile)"></xsl:variable>
> <xsl:variable name="textDocument"
> select="document($textFile)"></xsl:variable>
> <xsl:template match="*" name="test1"><xsl:result-document
> href="matchText-test.xml" method="xml">
> <!-- proof that I can get the terms -->
> <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>first term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[1]"/></xsl:comment>
> <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>second term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[2]"/></xsl:comment>
> <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>third term is
> </xsl:text><xsl:value-of
> select="$termsDocument/terms/term[3]"/></xsl:comment>
> <!-- now how to I find them in the $textDocument elements and mark them up?
> -->
> </xsl:result-document>
> </xsl:template>
> </xsl:stylesheet>



--
Regards,
Mukul Gandhi

Current Thread