[xsl] text nodes

Subject: [xsl] text nodes
From: James Cummings <James.Cummings@xxxxxxxxxxxxxx>
Date: Thu, 15 Apr 2004 11:15:23 +0100 (BST)
I'm converting a strangely formed html file that I've
'tidy'ed to xhtml to TEI xml.  The creators have included
line numbers inside the span they are using to mark lines
and always had 4 digits in order to left-justify them. They
then have marked those digits that aren't used with a <font>
tag and the same colour as the background.  (*sigh*).

What I want to acheive is to move the line numbers into
a line element and remove them from the text of that
line.  My first attempt was to use xsl:number, position()
and/or count() to just re-number the lines doesn't work
because the line numbers have been editorially decided
in certain places to compensate for missing lines, etc. and
may have no bearing on the number of lines in that particular
file.

Assuming <b> is the removeable <font> tag, but that
all the rest of the intellectual content needs to be
preserved, given:
-----
<root>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line</a>
<c><a>This is a line</a></c>
<a>5<b>000</b> line <d>five</d></a>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line</a>
<c><a>10<b>00</b> line ten</a></c>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line</a>
<a>This is a line and lots missing</a>
<a><d>1000 This is</d> also a line</a>
<a><d>This </d>is a line</a>
<a>3523 This is a later line</a>
<a>This is a line</a>
</root>
-----

My xsl currently looks like:
-----
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:template match="/"><div><xsl:apply-templates /></div></xsl:template>

<xsl:template match="b"/>

<xsl:template match="c"><p><xsl:apply-templates/></p></xsl:template>

<xsl:template match="d"><d><xsl:apply-templates/></d></xsl:template>

<xsl:template match="//a">
<xsl:variable name="num"><xsl:value-of select="translate(text()[1],
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz&amp;;-_ ', '')"/>
</xsl:variable>
<l><xsl:if test="not($num = '')">
<xsl:attribute name="n"><xsl:value-of
select="$num"/></xsl:attribute></xsl:if>
<xsl:apply-templates/></l>
</xsl:template>

</xsl:stylesheet>
-----

and so produces something like:

-----
<div>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line</l>
<p><l>This is a line</l></p>
<l n="5">5 line <d>five</d></l>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line</l>
<p><l n="10">10 line ten</l></p>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line</l>
<l>This is a line and lots missing</l>
<l><d>1000 This is</d> also a line</l>
<l><d>This </d>is a line</l>
<l n="3523">3523 This is a later line</l>
<l>This is a line</l>
</div>
----

I've tried matching and translate()'ing
text()[1] to remove numbers, but like my way
of getting the line numbers, it fails if the
line number happens to be inside another
element, as with 1000 in my example.

So how do I a) grab the line number more
successfully for the @n and b) remove the
line number from the text of the line without
removing anything I shouldn't, or missing
one?

Suggestions? Solutions?

-James

---
Dr James Cummings, Oxford Text Archive, University of Oxford
James.Cummings at ota.ahds.ac.uk

Current Thread