Subject: Re: [xsl] marking up text when term from other file is found From: Wolfgang Laun <wolfgang.laun@xxxxxxxxx> Date: Thu, 22 Apr 2010 13:54:42 +0200 |
Two comments and two questions. C1: The pattern containing all terms can be constructed once and not repeatedly within the template doing the analyze-string. C2: The flags attribute of analyze-string should be used to do a case insensitive match: flags='i' Q1: XSLT patterns don't have the zero-length assertion \b available to match a word boundary. This may result in unexpected matches. With analyze-string it is not possible to apply the usual trick of adding an extra character before and after the string. So how can an exact match be done here? Q2: If the index or document is big, it might be faster to have xsl:key on the indexTerms. Is it possible to construct such a key with the matching string being the original <term/> content *in lowercase*? Can it be done by constructing a temporary tree and applying xsl:key to that? -W On Thu, Apr 22, 2010 at 8:21 AM, Mukul Gandhi <gandhi.mukul@xxxxxxxxx> wrote: > > I would try to solve this as, following: > > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > version="2.0"> > > <xsl:output method="xml" indent="yes" /> > > <xsl:variable name="index-terms" select="document('indexTerms.xml')" /> > > <xsl:template match="node() | @*"> > <xsl:copy> > <xsl:apply-templates select="node() | @*" /> > </xsl:copy> > </xsl:template> > > <xsl:template match="text()" priority="10"> > <xsl:analyze-string select="." > regex="{string-join(for $term in > $index-terms/terms/term return concat('(', $term, ')'), '|')}"> > <xsl:matching-substring> > <xsl:variable name="idVal" select="string-join(for $attrVal in > $index-terms/terms/term[. = > regex-group(0)]/@*[starts-with(name(),'index')] return $attrVal, '_')" > /> > <ph id="{$idVal}"> > <xsl:value-of select="." /> > </ph> > </xsl:matching-substring> > <xsl:non-matching-substring> > <xsl:value-of select="." /> > </xsl:non-matching-substring> > </xsl:analyze-string> > </xsl:template> > > </xsl:stylesheet> > > You may adapt this, to suit your requirements if needed. > > On Thu, Apr 22, 2010 at 8:38 AM, Hoskins & Gretton > <hoskgret@xxxxxxxxxxxxxxxx> wrote: > > > > HI, I need help finding resources (examples and/or XSL) for this situation, > > for which I haven't found quite the right recipe in my searches of the list > > archives. > > Given an XML file containing a list of terms and another file containing a > > mix of elements containing text (narrative content, some inline markup for > > emphasis and footnotes), I was asked if I could find occurrences of each > > term wherever it appeared in the narrative content, and wrap each occurrence > > with a tag. So my first thought is to load up each document into a variable. > > But then I don't know what the most effective method of string comparison > > would be, given that the narrative document might have the term's words with > > different capitalization. If anyone can point me in the right direction, I'd > > appreciate it. Also I would like to know if there is a practical limit to > > how large a narrative file I can run with about 150 terms to find in the > > text. And if a different approach would work better, such as writing Java > > to do brute force search and replace, please tell me so. (I work with a > > Java programmer. Everything looks like a Java problem to her and an XSL > > problem to me.) > > -- Dorothy > > Note: Using Saxon B 9.1.0.7. I just made up a set of terms and a bad > > sentence as an example. > > Example of terms (indexTerms.xml): > > <?xml version="1.0" encoding="UTF-8"?> > > <terms> > > <term index1="anxiety">Anxiety</term> > > <term index1="children">Children</term> > > <term index1="children" index2="illness">Children, illness</term> > > <term index1="children" index2="nightmare">Children, nightmare</term> > > <term index1="cure" index2="fever">Cure fever</term> > > <term index1="cure" index2="illness">Cure illness</term> > > <term index1="anxiety" index2="nightmare">Nightmare</term> > > <term index1="children" index2="illness">Sick children</term> > > <term index1="anxiety" index2="phobia">Worries, phobias and anxiety</term> > > </terms> > > > > Example of narrative (sampleTopic.xml): > > <?xml version='1.0' encoding='UTF-8'?> > > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" > > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd"> > > <topic id="sampleTopic"> > > <title>sampleTopic</title> > > <body> > > <p>markup for sample terms testing a set of phrases to match to the > > content of index terms:</p> > > <p>Texttexttext text some of the terms are already in <ph> i.e. <ph > > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick > > children</ph> and sometime the same terms occur, <i>but different case</i>, > > not in a ph: Curing fever and <b>Sick children</b>. I need to get all the > > occurrences of each of the term element strings marked up with <ph> > > </p> > > </body> > > </topic> > > > > Desired result: > > <?xml version='1.0' encoding='UTF-8'?> > > <!DOCTYPE topic PUBLIC "-//OASIS//DTD DITA Topic//EN" > > "http://docs.oasis-open.org/dita/v1.1/OS/dtd/topic.dtd"> > > <topic id="sampleTopic"> > > <title>sampleTopic</title> > > <body> > > <p>markup for sample terms testing a set of phrases to match to the > > content of index terms:</p> > > <p>Texttexttext text some of the terms are already in <ph> i.e. <ph > > id="cure_fever">curing fever</ph>, <ph id="children_illness">sick > > children</ph> and sometime the same terms occur, <i>but different case</i>, > > not in a ph: <ph id="cure_fever">Curing fever</ph> and <b><ph > > id="children_illness">Sick children</ph></b>. I need to get all the > > occurrences of each of the term element strings marked up with <ph> > > </p> > > </body> > > </topic> > > > > XSL: > > <?xml version="1.0" encoding="UTF-8"?> > > <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" > > version="2.0"> > > <xsl:param name="indexFile">indexTerms.xml</xsl:param> > > <xsl:param name="textFile">sampleTopic.xml</xsl:param> > > <xsl:variable name="termsDocument" > > select="document($indexFile)"></xsl:variable> > > <xsl:variable name="textDocument" > > select="document($textFile)"></xsl:variable> > > <xsl:template match="*" name="test1"><xsl:result-document > > href="matchText-test.xml" method="xml"> > > <!-- proof that I can get the terms --> > > <xsl:text> </xsl:text><xsl:comment><xsl:text>first term is > > </xsl:text><xsl:value-of > > select="$termsDocument/terms/term[1]"/></xsl:comment> > > <xsl:text> </xsl:text><xsl:comment><xsl:text>second term is > > </xsl:text><xsl:value-of > > select="$termsDocument/terms/term[2]"/></xsl:comment> > > <xsl:text> </xsl:text><xsl:comment><xsl:text>third term is > > </xsl:text><xsl:value-of > > select="$termsDocument/terms/term[3]"/></xsl:comment> > > <!-- now how to I find them in the $textDocument elements and mark them up? > > --> > > </xsl:result-document> > > </xsl:template> > > </xsl:stylesheet> > > > > -- > Regards, > Mukul Gandhi
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] marking up text when term, Mukul Gandhi | Thread | Re: [xsl] marking up text when term, David Carlisle |
RE: [xsl] XSLT 1.0 : HTML table wit, Robby Pelssers | Date | Re: [xsl] marking up text when term, David Carlisle |
Month |