Re: [xsl] match string

Subject: Re: [xsl] match string
From: Anton Triest <anton@xxxxxxxx>
Date: Wed, 20 Oct 2004 22:17:32 +0200
Hi Wendell,

You want either:

(//text())[1]

(collects all the text nodes, returns only the first)

or

/descendant::text()[1]

(returns the first descendant text node).

OK... but now the problem is, none of both seem to be valid in a match pattern.


<xsl:template match="para(//text())[1]"> saxon says: "The only functions allowed in a pattern are id() and key()"
<xsl:template match="para/descendant::text()[1]"> saxon says: "Axis in pattern must be child or attribute"


(The first one is strange: is text() really a function? And even then, why is "para//text()[1]" a valid pattern and "para(//text())[1]" isn't?)

So I guess I'd have to use one of them in an apply-templates select attribute (instead of in match) but I'm stuck on how to combine that with the identity template. I could select "para(//text())[1]" but how would I select all the rest then (something like "para(//text())[position() > 1]" won't work).

Input XML:

<section>
<para>A paragraph without any markup</para>
<para> Beware of leading whitespace </para>
<para>A paragraph with some <i>markup</i> inside</para>
<para>A paragraph with some <b><i>nested</i> markup</b></para>
<para><em>This is a special case:</em> paragraph starts with markup</para>
<para><em>This</em> is difficult: only the first word has markup</para>
</section>


The goal is, to isolate the first 3 words of each paragraph. Desired output:

<section>
<para><first>A paragraph without </first>any markup</para>
<para><first>Beware of leading </first>whitespace</para>
<para><first>A paragraph with </first>some <i>markup</i> inside</para>
<para><first>A paragraph with </first>some <b><i>nested</i> markup</b></para>
<para><em><first>This is a </first>special case:</em> paragraph starts with markup</para>
<para><em><first>This</first></em> is difficult: only the first word has markup</para>
</section>


The last one is especially difficult, ideally that would be
<para><first><em>This</em> is difficult:</first> only the first word has markup</para>


Stylesheet so far:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output method="xml" version="1.0" encoding="utf-8" indent="yes"/>
<xsl:strip-space elements="*"/>


<xsl:param name="split" select="3"/>

   <!-- identity template: copy all elements -->
   <xsl:template match="*">
       <xsl:copy>
           <xsl:copy-of select="@*"/>
           <xsl:apply-templates/>
       </xsl:copy>
   </xsl:template>

   <xsl:template match="para/text()[1]">  <!--  <  <  <  -->
       <xsl:call-template name="split-words"/>
   </xsl:template>

<xsl:template name="split-words">
<xsl:param name="i" select="0"/>
<xsl:param name="str1" select="''"/>
<xsl:param name="str2" select="normalize-space(.)"/>
<xsl:choose>
<xsl:when test="$i = $split">
<first><xsl:value-of select="$str1"/></first>
<xsl:value-of select="$str2"/>
</xsl:when>
<xsl:otherwise>
<xsl:choose>
<xsl:when test="contains($str2,' ')">
<xsl:call-template name="split-words">
<xsl:with-param name="i" select="$i+1"/>
<xsl:with-param name="str1" select="concat($str1,substring-before($str2,' '),' ')"/>
<xsl:with-param name="str2" select="substring-after($str2,' ')"/>
</xsl:call-template>
</xsl:when>
<xsl:otherwise>
<xsl:call-template name="split-words">
<xsl:with-param name="i" select="$split"/>
<xsl:with-param name="str1" select="concat($str1,$str2)"/>
<xsl:with-param name="str2" select="''"/>
</xsl:call-template>
</xsl:otherwise>
</xsl:choose>
</xsl:otherwise>
</xsl:choose>
</xsl:template>


</xsl:stylesheet>

Output: correct except for the last 2 para's

<section>
<para><first>A paragraph without </first>any markup</para>
<para><first>Beware of leading </first>whitespace</para>
<para><first>A paragraph with </first>some<i>markup</i> inside</para>
<para><first>A paragraph with </first>some<b><i>nested</i> markup</b></para>
<para><em>This is a special case:</em><first>paragraph starts with </first>markup</para>
<para><em>This</em><first>is difficult: only </first>the first word has markup</para>
</section>


--
Anton

Current Thread