[xsl] regex, shortest match

Subject: [xsl] regex, shortest match
From: Dave Pawson <davep@xxxxxxxxxxxxx>
Date: Fri, 01 Aug 2008 08:14:37 +0100
I'm looking to parse sentences out of paras.

Input

<para>It is sometimes desired to have a specific heading which should not be numbered. This corresponds to unnumbered list headers in lists (see sections 4.3). To facilitate this, an optional attribute text:is-list-header can be used. If true, the given header will not be numbered, even if an explicit list-style is given. </para>
<para>A text:style-name attribute references a paragraph style, while a text:cond-style-name attribute references a conditional-style, that is, a style that contains conditions and maps to other styles (see section 14.1.1). If a conditional style is applied to a paragraph, the text:style-name attribute contains the name of the style that was the result of the conditional style evaluation, while the conditional style name itself is the value of the text:cond-style-name attribute. This XML structure simplifies [XSLT] transformations because XSLT only has to acknowledge the conditional style if the formatting attributes are relevant. The referenced style can be a common style or an automatic style.</para>
<para>A text:class-names attribute takes a whitespace separated list of paragraph style names. The referenced styles are applied in the order they are contained in the list. If both, text:style-name and text:class-names are present, the style referenced by the text:style-name attribute is as the first style in the list in text:class-names. If a conditional style is specified together with a style:class-names attribute, but without the text:style-name attribute, then the first style in the style list is used as the value of the missing text:style-name attribute. </para>
<para>A page sequence element &lt;text:page-sequence> specifies a sequence of master pages that are instantiated in exactly the same order as they are referenced in the page sequence. If a text document contains a page sequence, it will consist of exactly as many pages as specified. Documents with page sequences do not have a main text flow consisting of headings and paragraphs as is the case for documents that do not contain a page sequence. Text content is included within text boxes for documents with page sequences. The only other content that is permitted are drawing objects. </para>


This 'works', but hits the longest match. I can't come up with
a regex that has a sufficiently broad range, yet matches on the shortest
match.

Any suggestions please.

TIA DaveP


<xsl:template match="para"> <para> <xsl:variable name='contents' select="normalize-space(.)"/> <xsl:copy-of select="dp:sentence($contents)"/> </para> </xsl:template>

<!-- Isolate sentences within para's -->
<xsl:function name="dp:sentence">
  <xsl:param name="nd" as='xs:string'/>
  <xsl:analyze-string regex="((.+).) |$ " select="$nd">
    <xsl:matching-substring>
          <s>
            <xsl:value-of select="regex-group(1)"/>
          </s>
    </xsl:matching-substring>
    <xsl:non-matching-substring>
          <p2><xsl:value-of select="."/></p2>
    </xsl:non-matching-substring>
  </xsl:analyze-string>
</xsl:function>


regards


--
Dave Pawson
XSLT XSL-FO FAQ.
http://www.dpawson.co.uk

Current Thread