Re: [xsl] regex, shortest match

Subject: Re: [xsl] regex, shortest match
From: David Carlisle <davidc@xxxxxxxxx>
Date: Fri, 1 Aug 2008 09:42:22 +0100
> I'm looking to parse sentences out of paras.

to be more exact you are trying to parse a sentence with a regular
expression, which would cause you to fail a logic course as natural
language must be the canonical example of a non regular language:-)

> "((.+).)

. is a meta character matching any character so that is a sequence of
one or more characters, followed by a character, ie it's any sequence of
2 or more characters.




You need to define a sentence. If a sentemce can not contain a ".", but
always ends wiith a "." then you could do [^.]*\.

but then

it cost $2.00.

is two sentences.



So perhaps a sentence is terminated by . followed by end of string or
whitespace

 ([^.]|\.[^ \n\r\t])*\.(\s+|$)




<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
 

<xsl:output method="text"/>

<xsl:template match="para">

new para
<xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)">
<xsl:matching-substring>
 sentence: <xsl:value-of select="normalize-space(.)"/>
</xsl:matching-substring>
<xsl:non-matching-substring>
 oops:  <xsl:value-of select="normalize-space(.)"/>
</xsl:non-matching-substring>
</xsl:analyze-string>
</xsl:template>
</xsl:stylesheet>





 saxon9 para.xml para.xsl



new para

 sentence: It is sometimes desired to have a specific heading which should not be numbered.
 sentence: This corresponds to unnumbered list headers in lists (see sections 4.3).
 sentence: To facilitate this, an optional attribute text:is-list-header can be used.
 sentence: If true, the given header will not be numbered, even if an explicit list-style is given.


new para

 sentence: A text:style-name attribute references a paragraph style, while a text:cond-style-name attribute references a conditional-style, that is, a style that contains conditions and maps to other styles (see section 14.1.1).
 sentence: If a conditional style is applied to a paragraph, the text:style-name attribute contains the name of the style that was the result of the conditional style evaluation, while the conditional style name itself is the value of the text:cond-style-name attribute.
 sentence: This XML structure simplifies [XSLT] transformations because XSLT only has to acknowledge the conditional style if the formatting attributes are relevant.
 sentence: The referenced style can be a common style or an automatic style.


new para

 sentence: A text:class-names attribute takes a whitespace separated list of paragraph style names.
 sentence: The referenced styles are applied in the order they are contained in the list.
 sentence: If both, text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is as the first style in the list in text:class-names.
 sentence: If a conditional style is specified together with a style:class-names attribute, but without the text:style-name attribute, then the first style in the style list is used as the value of the missing text:style-name attribute.


new para

 sentence: A page sequence element <text:page-sequence> specifies a sequence of master pages that are instantiated in exactly the same order as they are referenced in the page sequence.
 sentence: If a text document contains a page sequence, it will consist of exactly as many pages as specified.
 sentence: Documents with page sequences do not have a main text flow consisting of headings and paragraphs as is the case for documents that do not contain a page sequence.
 sentence: Text content is included within text boxes for documents with page sequences.
 sentence: The only other content that is permitted are drawing objects.




but this would of course still fail if the sentence were to contain
". " coming from "D. P. Carlisle" or "dr. " or ...

If you try to parse natural language with a single regular expression,
it will _always_ fail. But you can cover more or less arbitrarily
complicated subsets of the language by making the regexp
correspondingly more complicated (and slow)


David

________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________

Current Thread