Re: [xsl] marking up text when term from other file is found

Subject: Re: [xsl] marking up text when term from other file is found
From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx>
Date: Thu, 22 Apr 2010 13:54:42 +0200
Two comments and two questions.

C1:  The pattern containing all terms can be constructed once and not
repeatedly within the template doing the analyze-string.

C2:  The flags attribute of analyze-string should be used to do a case
insensitive match: flags='i'


Q1:  XSLT patterns don't have the zero-length assertion \b available
to match a word boundary. This may result in unexpected matches. With
analyze-string it is not possible to apply the usual trick of adding
an extra character before and after the string. So how can an exact
match be done here?

Q2: If the index or document is big, it might be faster to have
xsl:key on the indexTerms. Is it possible to construct such a key with
the matching string being the original <term/> content *in lowercase*?
Can it be done by constructing a temporary tree and applying xsl:key
to that?

-W


On Thu, Apr 22, 2010 at 8:21 AM, Mukul Gandhi <gandhi.mukul@xxxxxxxxx> wrote:
>
> I would try to solve this as, following:
>
> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>                       version="2.0">
>
>  <xsl:output method="xml" indent="yes" />
>
>  <xsl:variable name="index-terms" select="document('indexTerms.xml')" />
>
>  <xsl:template match="node() | @*">
>    <xsl:copy>
>          <xsl:apply-templates select="node() | @*" />
>        </xsl:copy>
>  </xsl:template>
>
>  <xsl:template match="text()" priority="10">
>         <xsl:analyze-string select="."
>                             regex="{string-join(for $term in
> $index-terms/terms/term return concat('(', $term, ')'), '|')}">
>            <xsl:matching-substring>
>                 <xsl:variable name="idVal" select="string-join(for $attrVal
in
> $index-terms/terms/term[. =
> regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')"
> />
>                 <ph id="{$idVal}">
>                     <xsl:value-of select="." />
>                 </ph>
>           </xsl:matching-substring>
>           <xsl:non-matching-substring>
>               <xsl:value-of select="." />
>           </xsl:non-matching-substring>
>         </xsl:analyze-string>
>  </xsl:template>
>
> </xsl:stylesheet>
>
> You may adapt this, to suit your requirements if needed.
>
> On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton
> <hoskgret@xxxxxxxxxxxxxxxx> wrote:
> >
> > HI, I need help finding resources (examples and/or XSL) for this
situation,
> > for which I haven't found quite the right recipe in my searches of the
list
> > archives.
> > Given an XML file containing a list of terms and another file containing
a
> > mix of elements containing text (narrative content, some inline markup
for
> > emphasis and footnotes), I was asked if I could find occurrences of each
> > term wherever it appeared in the narrative content, and wrap each
occurrence
> > with a tag. So my first thought is to load up each document into a
variable.
> > But then I don't know what the most effective method of string comparison
> > would be, given that the narrative document might have the term's words
with
> > different capitalization. If anyone can point me in the right direction,
I'd
> > appreciate it. Also I would like to know if there is a practical limit to
> > how large a narrative file I can run with about 150 terms to find in the
> >  text. And if a different approach  would work better, such as writing
Java
> > to do  brute force search and replace, please tell me so. (I work with a
> > Java programmer. Everything looks like a Java problem to her and an XSL
> > problem to me.)
> > -- Dorothy
> > Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad
> > sentence as an example.
> > Example of terms (indexTerms.xml):
> > <?xml version="1.0" encoding="UTF-8"?>
> > <terms>
> >   <term index1="anxiety">Anxiety</term>
> >   <term index1="children">Children</term>
> >   <term index1="children" index2="illness">Children, illness</term>
> >   <term index1="children" index2="nightmare">Children, nightmare</term>
> >   <term index1="cure" index2="fever">Cure fever</term>
> >   <term index1="cure" index2="illness">Cure illness</term>
> >   <term index1="anxiety" index2="nightmare">Nightmare</term>
> >   <term index1="children" index2="illness">Sick children</term>
> >   <term index1="anxiety" index2="phobia">Worries, phobias and
anxiety</term>
> > </terms>
> >
> > Example of narrative (sampleTopic.xml):
> > <?xml version='1.0' encoding='UTF-8'?>
> > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
> > <topic id="sampleTopic">
> >  <title>sampleTopic</title>
> >  <body>
> >    <p>markup for sample terms testing a set of phrases to match to the
> > content of index terms:</p>
> >    <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> > children</ph> and sometime the same terms occur, <i>but different
case</i>,
> > not in a ph: Curing fever and <b>Sick children</b>. I need to get all the
> > occurrences of each of the term element strings marked up with &lt;ph&gt;
> > </p>
> >  </body>
> > </topic>
> >
> > Desired result:
> > <?xml version='1.0' encoding='UTF-8'?>
> > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN"
> > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd";>
> > <topic id="sampleTopic">
> >  <title>sampleTopic</title>
> >  <body>
> >    <p>markup for sample terms testing a set of phrases to match to the
> > content of index terms:</p>
> >    <p>Texttexttext text some of the terms are already in &lt;ph&gt; i.e.
<ph
> > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick
> > children</ph> and sometime the same terms occur, <i>but different
case</i>,
> > not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph
> > id="children_illness">Sick children</ph></b>. I need to get all the
> > occurrences of each of the term element strings marked up with &lt;ph&gt;
> > </p>
> >  </body>
> > </topic>
> >
> > XSL:
> > <?xml version="1.0" encoding="UTF-8"?>
> > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
> > version="2.0">
> > <xsl:param name="indexFile">indexTerms.xml</xsl:param>
> > <xsl:param name="textFile">sampleTopic.xml</xsl:param>
> > <xsl:variable name="termsDocument"
> > select="document($indexFile)"></xsl:variable>
> > <xsl:variable name="textDocument"
> > select="document($textFile)"></xsl:variable>
> > <xsl:template match="*" name="test1"><xsl:result-document
> > href="matchText-test.xml" method="xml">
> > <!-- proof that I can get the terms -->
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>first term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[1]"/></xsl:comment>
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>second term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[2]"/></xsl:comment>
> > <xsl:text>&#10;</xsl:text><xsl:comment><xsl:text>third term is
> > </xsl:text><xsl:value-of
> > select="$termsDocument/terms/term[3]"/></xsl:comment>
> > <!-- now how to I find them in the $textDocument elements and mark them
up?
> > -->
> > </xsl:result-document>
> > </xsl:template>
> > </xsl:stylesheet>
>
>
>
> --
> Regards,
> Mukul Gandhi

Current Thread