Re: [xsl] regex, shortest match

Subject: Re: [xsl] regex, shortest match
From: Dave Pawson <davep@xxxxxxxxxxxxx>
Date: Fri, 01 Aug 2008 10:19:05 +0100
David Carlisle wrote:
I'm looking to parse sentences out of paras.

to be more exact you are trying to parse a sentence with a regular expression, which would cause you to fail a logic course as natural language must be the canonical example of a non regular language:-)
Highly likely.


You need to define a sentence.

I tried with the worst examples in the source text.



So perhaps a sentence is terminated by . followed by end of string or
whitespace

([^.]|\.[^ \n\r\t])*\.(\s+|$)





but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...

If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)



<para>Sentance containing Dr. Michael Kay and D.P. Carlisle</para>


<grin/> I'd expect that to break most regexen :-)



  <xsl:template match="para">
    <para>
      <xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
        <xsl:matching-substring>
          <s> <xsl:value-of select="normalize-space(.)"/></s>
        </xsl:matching-substring>
        <xsl:non-matching-substring>
          <error> <xsl:value-of select="normalize-space(.)"/> </error>
        </xsl:non-matching-substring>
      </xsl:analyze-string>
    </para>
  </xsl:template>

Thanks David. That's better than my improvement.
No 'error' elements in 12000 lines.

Much appreciated.

regards

--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

Current Thread