[xsl] Washington method

Subject: [xsl] Washington method
From: Dave Pawson <davep@xxxxxxxxxxxxx>
Date: Sun, 10 Apr 2011 09:45:44 +0100
Was Processing two documents. which order?

Finally got my tiny mind round this one and I believe it is worth
spending some time on to explain it.

Problem: Some text, in XML preferably for which some parts are required
to be marked up as XML in the output. 

The approach.
An external file contains the word list, as xml.
The main input file contains the text needing marking up.

the 'word list' looks something like

<x>
<word>target:word</word>
...
</x>

The xslt contains the following

<xsl:key name="words" match="word" use="."/>

Options:
1 wanted simply to do the markup, no more processing hence
the stylesheet had

<xsl:template match="node()">
 <xsl:copy>
  <xsl:copy-of select="@*"/>
  <xsl:apply-templates/>
 </xsl:copy>
</xsl:template>

If you want other processing then add templates as needed.

The work is done in this template

<xsl:template match="text()[not(parent::a or 
		     parent::b or 
		     parent::c ] priority="2">
 <xsl:analyze-string select="." regex="[a-z][a-z\-:.]+">
  <xsl:matching-substring>
   <xsl:choose>
     <xsl:when test="key('w',.,doc('../props.xml'))">
     <tag>
      <xsl:value-of select="."/>
     </tag>
    </xsl:when>
    <xsl:otherwise>
     <xsl:value-of select="."/>
    </xsl:otherwise>
   </xsl:choose>
  </xsl:matching-substring>
  <xsl:non-matching-substring>
   <xsl:value-of select="."/>
  </xsl:non-matching-substring>   
 </xsl:analyze-string>
</xsl:template>

1. The regex should match on any character group that *may* contain
one of the wanted words. I had to include - : and . since
the text contained those characters.

2. The 'tag' element is used to markup matches.
  A candidate match occurs when the regex makes a hit, in the
  matching-substring element.
   A further selection is made, matching the key (from the external
   document). Only then does markup happen

3. I required not to markup text in some elements, hence the filtering
not(parent::a or 
		     parent::b or 
		     parent::c ] 
which exludes the text from all these elements. 

In hindsight, the method does not use the character subtraction class,
just the escaping needed (since I needed to match on word-nextword) 
confused me.

Repetition against a parameter for the case I had took 15 minutes.
Using this method, 4 seconds.


In retrospect, it is a valuable addition to any toolkit IMHO.

Washington method? From David Carlisle of course :-)
Thanks David.







-- 

regards 

-- 
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

Current Thread