Re: [xsl] Splitting a paragraph into sentences and keep markup

Subject: Re: [xsl] Splitting a paragraph into sentences and keep markup
From: "Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Sun, 24 Nov 2019 17:15:17 -0000
Therebs a package for splitting at arbitrarily deeply nested nodes. It is part of a paper that I presented at XML Prague this year: https://archive.xmlprague.cz/2019/files/xmlprague-2019-proceedings.pdf#page=347

The package itself is at https://subversion.le-tex.de/common/presentations/2019-02-09_xmlprague_xslt-upward-projection/lib/split.xsl

Using this package, Martin's p-matching template becomes:

<xsl:template match="p[node()]">
<xsl:variable name="p-with-markers" as="element(p)">
<xsl:apply-templates select="." mode="insert-marker"/>
</xsl:variable><!-- this hasn't changed -->
<xsl:variable name="chunks" as="document-node(element(split:chunks))">
<xsl:apply-templates select="$p-with-markers"
mode="split:split-entrypoint"><!-- mode provided by
lib/split.xsl -->
<xsl:with-param name="group-start-exp" as="xs:string"
select="'self::eos'"/><!-- Will be evaluated as an XPath
expression for each node in a for-each-group[@group-starting-with]
population. If a population node satisfies the expression, it will
start a group.-->
<xsl:with-param name="keep-splitting-node" as="xs:boolean"
select="false()"/><!-- remove <eos/> after splitting -->
</xsl:apply-templates>
</xsl:variable>
<xsl:copy-of select="$chunks/split:chunks/split:chunk/p[node()]"
copy-namespaces="no"/>
</xsl:template>


The complete stylesheet is at https://gist.github.com/gimsieke/529dab000386a45d6136e850a80ac726

Applying it to your input, David, will yield:

<?xml version="1.0" encoding="UTF-8"?><root>
<p>This has one <span class="zzz">sentence? </span></p><p><span class="zzz">Actually, it has
<emphasis>two</emphasis>. </span></p><p><span class="zzz">No,</span> it has three.</p>
</root>


Gerrit


On 24.11.2019 15:32, David Carlisle d.p.carlisle@xxxxxxxxx wrote:
can we assume the easy case (as in your example) where all the
sentences end at the top level?

a more challenging example is

<root>
     <p>This has one <span class="zzz">sentence? Actually, it has
<emphasis>two</emphasis>.  No,</span> it has three.</p>
</root>

as then you need to force-close any open elements at the sentence end
and re-open them in the new sentence so something like

   <p>This has one <span class="zzz">sentence?</span></p>
   <p><span class="zzz">Actually, it has <emphasis>two</emphasis>.</span></p>
  <p><span class="zzz">No,</span> it has three.</p>

David

On Sun, 24 Nov 2019 at 13:34, Rick Quatro rick@xxxxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

Hi All,




I have a situation where I want to split a short paragraph into sentences and use them in different parts of my output. I am using <xsl:analyze-string> because I want to account for a sentence ending with a . or ?. This will work except if there are any children of the paragaph, like the <emphasis> in the second sentence. Can I split a paragraph into sentences and still keep the markup?



Here is my input document:



<?xml version="1.0" encoding="UTF-8"?>

<root>

<p>This has one sentence? Actually, it has <emphasis>two</emphasis>. No, it has three.</p>

</root>



My stylesheet:



<?xml version="1.0" encoding="UTF-8"?>

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";

xmlns:xs="http://www.w3.org/2001/XMLSchema";

xmlns:rq="http://www.frameexpert.com";

exclude-result-prefixes="xs rq"

version="2.0">



<xsl:output indent="yes"/>

<xsl:strip-space elements="root"/>



<xsl:template match="/root">

<xsl:copy>

<xsl:apply-templates/>

</xsl:copy>

</xsl:template>



<xsl:template match="p">

<xsl:variable name="sentences" select="rq:splitParagraphIntoSentences(.)"/>

<p><xsl:value-of select="$sentences[1]"/></p>

<note>Something in between.</note>

<p><xsl:value-of select="$sentences[position()&gt;1]"/></p>

</xsl:template>



<xsl:function name="rq:splitParagraphIntoSentences">

<xsl:param name="paragraph"/>

<xsl:analyze-string select="$paragraph" regex=".+?[\.\?](\s+|$)">

<xsl:matching-substring>

<sentence><xsl:value-of select="replace(.,'\s+$','')"/></sentence>

</xsl:matching-substring>

</xsl:analyze-string>

</xsl:function>

</xsl:stylesheet>



My output:



<?xml version="1.0" encoding="UTF-8"?>

<root>

<p>This has one sentence?</p>

<note>Something in between.</note>

<p>Actually, it has two. No, it has three.</p>

</root>



What I want is this:



<?xml version="1.0" encoding="UTF-8"?>

<root>

<p>This has one sentence? </p>

<note>Something in between.</note>

<p>Actually, it has <emphasis>two</emphasis>. No, it has three. </p>

</root>



Any suggestions will be appreciated.



Rick

Current Thread