Re: [xsl] Splitting text nodes - xsl:iterate?

Subject: Re: [xsl] Splitting text nodes - xsl:iterate?
From: "Martin Honnen martin.honnen@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 12 Nov 2014 18:18:38 -0000
Tom Cleghorn tcleghorn@xxxxxxxxxxxxx wrote:

Given an input document looking something like this:
<doc>
   <head><foo/><bar/><baz/></head>
   <body>
     <sec>
       <para>Lorem ipsum dolor sit amet, consectetur adipiscing
elit.<box outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum
urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis leo sit amet, ornare
consequat</qux> metus. Ut vel massa congue, egestas nibh et, rutrum
odio.</para>
     </sec>
   </body>
</doc>

(i.e. document markup consisting of arbitrary text and element nodes
nested to some unknown depth)

and the requirement for two separate outputs looking like these:
<doc>
   <head><foo/><bar/><baz/></head>
   <body>
     <sec>
       <para><new:start/>Lorem ipsum dolor sit amet, consectetur
adipiscing elit.<box outline="maybe"><para quack="y">Proin id <?foo
bar?>bibendum urna, <baz>ut ornare</baz> mi.</para></box></para>
       <para>Aenean dui risus, <qux>sodales quis <new:end/>leo sit amet,
ornare consequat</qux> metus. Ut vel massa congue, egestas nibh et,
rutrum odio.</para>
     </sec>
   </body>
</doc>

<sec>
   <para>Lorem ipsum dolor sit amet, consectetur adipiscing elit.<box
outline="maybe"><para quack="y">Proin id <?foo bar?>bibendum urna,
<baz>ut ornare</baz> mi.</para></box></para>
   <para>Aenean dui risus, <qux>sodales quis [...]</qux></para>
</sec>

(i.e. a copy of the input, with new:start and new:end elements marking
the first 20 words of the document; and separately a copy of those first
twenty words, preserving all markup within them and adding ellipses at
the end)

I tried the following with Saxon 9.6 PE:


<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="3.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
xmlns:xs="http://www.w3.org/2001/XMLSchema";
xmlns:xf="http://www.w3.org/2005/xpath-functions";
xmlns:new="http://example.com/new";
exclude-result-prefixes="xs xf">


<xsl:param name="size" as="xs:integer" select="20"/>

<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

<xsl:param name="file-name" as="xs:string" select="'test2014111202Text.xml'"/>

<xsl:variable name="start-node" as="text()?" select="descendant::text()[normalize-space()][1]"/>

<xsl:variable name="end-node" as="text()?"
select="descendant::text()[normalize-space() and matches(string-join((preceding::text()[normalize-space()], .), ''), $regex)][1]"/>


<xsl:variable name="end-words" as="xs:string?"

select="replace(string-join(($end-node/preceding::text()[normalize-space()], $end-node), ''), $regex, '')"/>

<xsl:template match="/">

  <xsl:variable name="d1">
    <xsl:apply-templates/>
  </xsl:variable>

<xsl:copy-of select="$d1"/>

<xsl:result-document href="{$file-name}">
<xsl:variable name="split" select="$d1//new:end"/>
<xsl:variable name="copy" select="$split/(ancestor-or-self::node() | preceding::node())"/>
<xsl:apply-templates select="($copy//sec)[1]" mode="sep">
<xsl:with-param name="nodes" select="$copy" tunnel="yes"/>
</xsl:apply-templates>
</xsl:result-document>


</xsl:template>

<xsl:template match="node()" mode="sep">
  <xsl:param name="nodes" tunnel="yes"/>
  <xsl:if test=". intersect $nodes">
    <xsl:copy>
      <xsl:apply-templates select="@* , node()" mode="sep"/>
    </xsl:copy>
  </xsl:if>
</xsl:template>

<xsl:template match="new:start" mode="sep"/>

<xsl:template match="new:end" mode="sep">
  <xsl:text>[...]</xsl:text>
</xsl:template>

<xsl:template match="@* | node()">
  <xsl:copy>
    <xsl:apply-templates select="@* , node()"/>
  </xsl:copy>
</xsl:template>

<xsl:template match="$start-node" priority="5">
<new:start/>
<!-- would like
<xsl:next-match/>
to either use the identity transformation template if start-node and $end-node are different
or the template below if they are the same
but ran into a problem with Saxon 9.6 PE
-->
<xsl:value-of select="."/>
</xsl:template>


<xsl:template match="$end-node">
  <xsl:value-of select="substring-before(., $end-words)"/>
  <new:end/>
  <xsl:value-of select="$end-words"/>
</xsl:template>

</xsl:stylesheet>


I think it produces the output you want for the input you posted but I have not tried it on other samples. Obviously part of the approach is writing a regular expression that identifies the "words", I used


<xsl:variable name="regex" as="xs:string" select="concat('^(\w+[\s\p{P}]+){', $size, '}')"/>

which works on your sample but would fail for instance if the first text nodes with words starts with white space or punctuation characters.

Current Thread