Subject: Re: [xsl] regex, shortest match From: David Carlisle <davidc@xxxxxxxxx> Date: Fri, 1 Aug 2008 09:42:22 +0100 |
> I'm looking to parse sentences out of paras. to be more exact you are trying to parse a sentence with a regular expression, which would cause you to fail a logic course as natural language must be the canonical example of a non regular language:-) > "((.+).) . is a meta character matching any character so that is a sequence of one or more characters, followed by a character, ie it's any sequence of 2 or more characters. You need to define a sentence. If a sentemce can not contain a ".", but always ends wiith a "." then you could do [^.]*\. but then it cost $2.00. is two sentences. So perhaps a sentence is terminated by . followed by end of string or whitespace ([^.]|\.[^ \n\r\t])*\.(\s+|$) <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output method="text"/> <xsl:template match="para"> new para <xsl:analyze-string select="." regex="([^.]|\.[^ \n\r\t])*\.(\s+|$)"> <xsl:matching-substring> sentence: <xsl:value-of select="normalize-space(.)"/> </xsl:matching-substring> <xsl:non-matching-substring> oops: <xsl:value-of select="normalize-space(.)"/> </xsl:non-matching-substring> </xsl:analyze-string> </xsl:template> </xsl:stylesheet> saxon9 para.xml para.xsl new para sentence: It is sometimes desired to have a specific heading which should not be numbered. sentence: This corresponds to unnumbered list headers in lists (see sections 4.3). sentence: To facilitate this, an optional attribute text:is-list-header can be used. sentence: If true, the given header will not be numbered, even if an explicit list-style is given. new para sentence: A text:style-name attribute references a paragraph style, while a text:cond-style-name attribute references a conditional-style, that is, a style that contains conditions and maps to other styles (see section 14.1.1). sentence: If a conditional style is applied to a paragraph, the text:style-name attribute contains the name of the style that was the result of the conditional style evaluation, while the conditional style name itself is the value of the text:cond-style-name attribute. sentence: This XML structure simplifies [XSLT] transformations because XSLT only has to acknowledge the conditional style if the formatting attributes are relevant. sentence: The referenced style can be a common style or an automatic style. new para sentence: A text:class-names attribute takes a whitespace separated list of paragraph style names. sentence: The referenced styles are applied in the order they are contained in the list. sentence: If both, text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is as the first style in the list in text:class-names. sentence: If a conditional style is specified together with a style:class-names attribute, but without the text:style-name attribute, then the first style in the style list is used as the value of the missing text:style-name attribute. new para sentence: A page sequence element <text:page-sequence> specifies a sequence of master pages that are instantiated in exactly the same order as they are referenced in the page sequence. sentence: If a text document contains a page sequence, it will consist of exactly as many pages as specified. sentence: Documents with page sequences do not have a main text flow consisting of headings and paragraphs as is the case for documents that do not contain a page sequence. sentence: Text content is included within text boxes for documents with page sequences. sentence: The only other content that is permitted are drawing objects. but this would of course still fail if the sentence were to contain ". " coming from "D. P. Carlisle" or "dr. " or ... If you try to parse natural language with a single regular expression, it will _always_ fail. But you can cover more or less arbitrarily complicated subsets of the language by making the regexp correspondingly more complicated (and slow) David ________________________________________________________________________ The Numerical Algorithms Group Ltd is a company registered in England and Wales with company number 1249803. The registered office is: Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom. This e-mail has been scanned for all viruses by Star. The service is powered by MessageLabs. ________________________________________________________________________
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] regex, shortest match, Dave Pawson | Thread | Re: [xsl] regex, shortest match, Dave Pawson |
[xsl] DTD Vs Schema, Pankaj Chaturvedi | Date | Re: [xsl] regex, shortest match, Dave Pawson |
Month |