Re: [xsl] Doubly recursive find/replace from list problem

Subject: Re: [xsl] Doubly recursive find/replace from list problem
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Sun, 11 Jul 2004 11:14:03 +0100
Hi,

> I have an XML journal source, and an XML list of acronyms. I wish to
> automatically replace any occurrence of an acronym within the XML
> source, with the appropriate <acronym title="blah">acronym</acronym>
> tag. It's easy to replace one acronym, using simple XSL recursive
> find/replace, but when I try to do more than one, I hit multiple
> difficulties.

OK. First off, you need to design the parameters for the template. The
things that matter are the list of acronyms that you want to be
replaced and the text in which you want to replace them:

<xsl:template name="replace-acronyms">
  <xsl:param name="acronyms"
    select="document('../xml/acronyms.xml')/acronyms/acronym" />
  <xsl:param name="text" />
  ...
</xsl:template>

The first tests to make are the stopping conditions: if $text is
empty, then the template shouldn't generate anything; if $acronyms is
empty, then the template should just return $text:

  <xsl:choose>
    <xsl:when test="not($acronyms)">
      <xsl:value-of select="$text" />
    </xsl:when>
    <xsl:when test="not(string($text))" />
    <xsl:otherwise>
      ...
    </xsl:otherwise>
  </xsl:choose>

Now we've confirmed that we actually have some text to process and
some acronyms to replace within it, we'll set about our first task: to
replace the first occurrence in $text of the first acronym in
$acronyms. We'll store the first acronym that we want to find in
$acronyms in a variable called $acronym:

  <xsl:variable name="acronym" select="$acronyms[1]/@acronym" />

Note that I'm assuming the <acronym> elements are of the form:

  <acronym acronym="XML">Extensible Markup Language</acronym>

What we do then depends on whether $acronym appears in $text or not:

  <xsl:choose>
    <xsl:when test="contains($text, $acronym)">
      ...
    </xsl:when>
    <xsl:otherwise>
      ...
    </xsl:otherwise>
  </xsl:choose>

If $acronym *doesn't* appear in $text, then we want to call the
template again on the unadjusted text, with $acronyms this time set to
the *rest* of the acronyms (all but the first):

  <xsl:otherwise>
    <xsl:call-template name="replace-acronyms">
      <xsl:with-param name="text" select="$text" />
      <xsl:with-param name="acronyms"
                      select="$acronyms[position() >  1]" />
    </xsl:call-template>
  </xsl:otherwise>

If $acronym *does* appear in text, then we need to break the text into
two parts: the part before $acronym and the part after $acronym:

  <xsl:variable name="before"
                select="substring-before($text, $acronym)" />
  <xsl:variable name="after"
                select="substring-after($text, $acronym)" />

Now, we know that $acronym doesn't appear in $before (because $before
is, by definition, the text before the first occurrence of $acronym),
but $before might contain other acronyms. So we need to call the
template on $before with the 'rest' of the acronyms:

  <xsl:call-template name="replace-acronyms">
    <xsl:with-param name="text" select="$before" />
    <xsl:with-param name="acronyms"
                    select="$acronyms[position() >  1]" />
  </xsl:call-template>

Then we need to generate the <acronym> element. The title attribute
needs to hold the value of the first <acronym> element in $acronyms,
and the value of the <acronym> element is the acronym $acronym itself:

  <acronym title="{$acronyms[1]}">
    <xsl:value-of select="$acronym" />
  </acronym>

Then we need to do something with $after. Now, $after could contain
$acronym again, so the recursive call needs to pass *all* the
$acronyms through to the text call:

  <xsl:call-template name="replace-acronyms">
    <xsl:with-param name="text" select="$after" />
    <xsl:with-param name="acronyms" select="$acronyms" />
  </xsl:call-template>

And there we have it. The complete template looks like:

<xsl:template name="replace-acronyms">
  <xsl:param name="acronyms"
    select="document('../xml/acronyms.xml')/acronyms/acronym" />
  <xsl:param name="text" />
  <xsl:choose>
    <xsl:when test="not($acronyms)">
      <xsl:value-of select="$text" />
    </xsl:when>
    <xsl:when test="not(string($text))" />
    <xsl:otherwise>
      <xsl:variable name="acronym" select="$acronyms[1]/@acronym" />
      <xsl:choose>
        <xsl:when test="contains($text, $acronym)">
          <xsl:variable name="before"
                        select="substring-before($text, $acronym)" />
          <xsl:variable name="after"
                        select="substring-after($text, $acronym)" />
          <xsl:call-template name="replace-acronyms">
            <xsl:with-param name="text" select="$before" />
            <xsl:with-param name="acronyms"
                            select="$acronyms[position() >  1]" />
          </xsl:call-template>
          <acronym title="{$acronyms[1]}">
            <xsl:value-of select="$acronym" />
          </acronym>
          <xsl:call-template name="replace-acronyms">
            <xsl:with-param name="text" select="$after" />
            <xsl:with-param name="acronyms" select="$acronyms" />
          </xsl:call-template>
        </xsl:when>
        <xsl:otherwise>
          <xsl:call-template name="replace-acronyms">
            <xsl:with-param name="text" select="$text" />
            <xsl:with-param name="acronyms"
                            select="$acronyms[position() >  1]" />
          </xsl:call-template>
        </xsl:otherwise>
      </xsl:choose>
    </xsl:otherwise>
  </xsl:choose>
</xsl:template>

So, the key points here are:

 1. work through the acronyms using recursion rather than iteration
 2. recurse on the before and after portions of the text
 3. treat generated XML as XML rather than as a string (so don't use
    disable-output-escaping to create it)

To complete this email, I'll just mention that in XSLT 2.0, you can
use <xsl:analyze-string> to do this. Something along the lines of:

  <xsl:variable name="acronyms" as="element(acronym)+"
    select="document('../xml/acronyms.xml')/acronyms/acronym" />
  <xsl:variable name="acronym-regex" as="xs:string"
    select="string-join($acronyms/@acronym, '|')" />
  <xsl:analyze-string select="$text" regex="{$acronym-regex}">
    <xsl:matching-substring>
      <xsl:variable name="acronym" as="xs:string" select="." />
      <acronym title="{$acronyms[@acronym = $acronym]}">
        <xsl:value-of select="$acronym" />
      </acronym>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
      <xsl:value-of select="." />
    </xsl:non-matching-substring>
  </xsl:analyze-string>

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

Current Thread