Re: [xsl] detect sentence surrounding a tag

Subject: Re: [xsl] detect sentence surrounding a tag
From: "Flynn, Peter pflynn@xxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 27 Jul 2016 08:58:31 -0000
On 26/07/16 21:21, Dorothy Hoskins dorothy.hoskins@xxxxxxxxx wrote:
> HI, in the case of the element A containing multiple sentences (assuming
> "." as end of sentence punctuation), is there a reliable way to find the
> sentence that surrounds the child element B wherever it occurs in A?
>
> I think that the solution (regex?) will have to look backwards from the
> start tag of B and past the end tag of A to the nearest "."
>
> I recognize that if there is some abbreviation or decimal number in the
> sentence that will be interpreted as the end of sentence. That's OK as a
> limitation.

Very crudely, yes (I have taken the liberty of adding a dot after the
question mark and the quoted dot in your example to make them fit the
pattern of "sentence ends with dot"):

========================== test.xml =================================
<A>HI, in the case of the element A containing multiple sentences
  (assuming "." as end of sentence punctuation), is there a reliable
  way to find the sentence that surrounds <B>the child element B</B>
  wherever it occurs in A?. I think that the solution (regex?) will
  have to look backwards from the start tag of <B>B and past the end
    tag of A</B> to the nearest ".". I recognize that if there is some
  abbreviation or decimal number in the sentence that will be
  interpreted as the end of sentence. That's OK as a limitation.</A>
========================== test.xsl ==================================
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                version="2.0">

  <xsl:output method="xml"/>

  <xsl:template match="/">
    <text>
      <xsl:apply-templates/>
    </text>
  </xsl:template>

  <xsl:template match="A">
    <xsl:for-each select="B">
      <sentence>
        <xsl:value-of
          select="tokenize(preceding-sibling::text()[1],'\. ')
                  [position()=last()]"/>
        <xsl:value-of select="."/>
        <xsl:variable name="posttext"
          select="following-sibling::text()[1]"/>
        <xsl:value-of
          select="tokenize(following-sibling::text()[1],'\. ')[1]"/>
        <xsl:text>.</xsl:text>
      </sentence>
    </xsl:for-each>
  </xsl:template>

</xsl:stylesheet>
============================ output =================================
<?xml version="1.0" encoding="UTF-8"?><text><sentence>HI, in the case of
the element A containing multiple sentences
  (assuming "." as end of sentence punctuation), is there a reliable
  way to find the sentence that surrounds the child element B
  wherever it occurs in A?.</sentence><sentence>I think that the
solution (regex?) will
  have to look backwards from the start tag of B and past the end
    tag of A to the nearest ".".</sentence></text>
=====================================================================

This will fail on a probably significant number of test cases. Making it
work with sentences ending in question marks, exclamation marks, quoted
dots, etc is left as an exercise...:-)

///Peter

///Peter
--
Peter Flynn | Academic & Collaborative Technologies | University College
Cork IT Services | b +353 21 490 2609 | b	 pflynn@xxxxxx | p
 www.ucc.ie

Current Thread