[xsl] Grouping by character runs (and keeping element structure)

Subject: [xsl] Grouping by character runs (and keeping element structure)
From: "Christian Roth" <roth@xxxxxxxxxxxxxx>
Date: Thu, 27 Jul 2006 12:34:33 +0200
Continuing my grouping issues: 

XSLT2 handles grouping on a node level quite conveniently. However,
adding structure to legacy, rather flat content (i.e.: character runs)
still poses challenges in grouping. The following applies mainly to
document-centric (as opposed to data-centric) XML.

__ EXAMPLE 1 __

<p>Note #4: Don't tumble dry your pet.</p>

TASK:
Group the leading paragraph text "Note #4:" using <marker> so that the
result looks like (indented for readibility):

<p><marker>Note #4:</marker>
   Don't tumble dry your pet.</p> 

SOLUTION:
The solution is easy, as we can just work on the text without having to
worry about markup:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
version="2.0">

  <xsl:template match="p">
    <xsl:copy>
      <xsl:analyze-string select="." regex="^Note\s#\d+:">
        <xsl:matching-substring>
          <marker>
            <xsl:value-of select="." />
          </marker>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <xsl:value-of select="normalize-space(.)" />
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </xsl:copy>
  </xsl:template>
</xsl:stylesheet>


However, in "real" documents, you will have likely something like this:


__ EXAMPLE 2 __

<p><b>Note</b> <i>#4</i>: Don't tumble dry your pet.</p>

TASK:
Group the leading paragraph text "Note #4:" including any contained
markup using <marker> so that the result looks like:

<p><marker><b>Note</b> <i>#4</i>:</marker>
   Don't tumble dry your pet.</p> 

SOLUTION:
Here it starts to get really complicated. Since now the text will
contain markup we need to retain, but the text run is still to be
considered from the <p> level (so that we can test for "starts with
pattern" using '^'), <xsl:analyze-string/> does not seem to do the trick
in this case.

A worst-case scenario of course would be:


__ EXAMPLE 3 __

<p><ul><b>Note</b> <i>#4</i>: Don't tumble dry your pet</ul>.</p>

TASK:
Group the leading paragraph text "Note #4:" including any contained
markup using <marker> to a child of <p> so that the result looks like:

<p><marker><ul><b>Note</b> <i>#4</i>:</ul></marker>
   <ul>Don't tumble dry your pet</ul>.</p> 

SOLUTION: 
Same problems as in EXAMPLE 2, but additionally note that the <ul>
element must be split/duplicated so that <marker> can be a child of <p>,
yet retains the full formatting info in form of the contained element
structure.


Is there a certain pattern on how to tackle these kind of problems in
XSLT, or is the language just not the tool of choice for this kind of
transformation?


-Christian

Current Thread