[xsl] normalize-space and newlines

Subject: [xsl] normalize-space and newlines
From: "Trevor Nicholls" <trevor@xxxxxxxxxxxxxxxxxx>
Date: Wed, 24 Aug 2005 13:31:14 +1200
Hello

Moving on from cleaning up yesterday's not-so-difficult-after-all "noise" in
my ex-Framemaker file there are still a couple of things which confuse me,
in particular some persistent newlines.

The first few lines of my original input document are as follows:
----
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="fm-doc.css" type="text/css" charset="UTF-8"?>
<XML>
<TITLE> New Product Features</TITLE><DIV>
<Heading1>
<A ID="pgfId-5564543"></A>
<DIV>
<IMAGE xml:link="simple" href="fm-doc-1.gif" show="embed" actuate="auto"/>
</DIV>
New Product Features<IMAGE xml:link="simple" href="fm-doc-2.gif"
show="embed" actuate="auto"/>
</Heading1>
<DIV>
<Heading2>
<A ID="pgfId-5564712"></A>
<DIV>
<IMAGE xml:link="simple" href="fm-doc-3.gif" show="embed" actuate="auto"/>
</DIV>
Improved<IMAGE xml:link="simple" href="fm-doc-4.gif" show="embed"
actuate="auto"/>
<A ID="pgfId-5564713"></A>
Performance<IMAGE xml:link="simple" href="fm-doc-5.gif" show="embed"
actuate="auto"/>
</Heading2>
<Body>
...
----

The first XSL pass strips out any completely empty nodes, removes the <A>
entities and also drops any images within headings (they're purely
ornamental anyway):

----
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output doctype-system="mydoc.dtd" method="xml" encoding="ISO-8859-1"/>

<xsl:template match="XML">
<xsl:copy>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:template>

<xsl:template match="@*|node()" mode="copy">
<xsl:if test="node() or * or text() or string(.)">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:if>
</xsl:template>

<!-- don't copy A tags -->
<xsl:template match="A" mode="copy"/>

<!-- drop IMAGE tags within headings -->
<xsl:template match="IMAGE" mode="copy">
<xsl:if test="not(ancestor::*[self::Heading1 or self::Heading2 or
self::Heading3 or self::Heading4 or self::Heading5])">
<xsl:copy>
<xsl:copy-of select="@*"/>
</xsl:copy>
</xsl:if>
</xsl:template>

</xsl:stylesheet>
----

This leaves me with the following XML:
----
<XML><TITLE> New Product Features</TITLE><DIV><Heading1><DIV/>
New Product Features</Heading1><DIV><Heading2><DIV/>
Improved
Performance</Heading2><Body>
...
----

There are two more things I want to do before translating the content of
this document into my own structure. Firstly dropping all <A> and certain
<IMAGE> tags has left me with a number of empty <DIV> entities, so pass2
repeats the pass1 technique to do this. Secondly a lot of the entities now
contain newlines and/or leading spaces, thanks to Framemaker's choice of
when to break the lines (usually after an opening tag and before the
content).

So this is my second pass stylesheet:
----
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output doctype-system="mydoc.dtd" method="xml" encoding="ISO-8859-1"/>

<xsl:template match="XML">
<xsl:copy>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:template>

<!-- Drop any empty nodes -->
<xsl:template match="@*|node()" mode="copy">
<!-- non-empty: has children, is a text node, has value or attribute -->
<xsl:if test="node() or * or text() or string(.) or @*">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates mode="copy"/>
</xsl:copy>
</xsl:if>
</xsl:template>

<!-- fold newlines in text elements to spaces, etc. -->
<xsl:template match="text()" priority="2">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>

</xsl:stylesheet>
----

After passing the newer file through this transform I have:
----
<XML><TITLE> New Product Features</TITLE><DIV><Heading1>
New Product Features</Heading1><DIV><Heading2>
Improved
Performance</Heading2><Body>
...
----

As you can see it's dropped the empty entities but done nothing to address
the newlines issue. The newlines issue might not matter except for what is
coming in my third pass, where I try and remap the document content to a
different structure.

Here's the relevant part of the third stylesheet:
----
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0"
xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output doctype-system="mydoc.dtd" method="xml" encoding="ISO-8859-1"/>

<xsl:template match="XML">
<xsl:variable name="doctitle"><xsl:value-of select="TITLE"/></xsl:variable>
<document title="{$doctitle}">
<xsl:apply-templates/>
</document>
</xsl:template>

<xsl:template match="TITLE"/>

<xsl:template match="DIV">
<xsl:variable name="sectitle">
<xsl:if test="Heading1"><xsl:value-of select="Heading1"/></xsl:if>
<xsl:if test="Heading2"><xsl:value-of select="Heading2"/></xsl:if>
<xsl:if test="Heading3"><xsl:value-of select="Heading3"/></xsl:if>
<xsl:if test="Heading4"><xsl:value-of select="Heading4"/></xsl:if>
<xsl:if test="Heading5"><xsl:value-of select="Heading5"/></xsl:if>
</xsl:variable>
<section title="{$sectitle}">
<xsl:apply-templates/>
</section>
</xsl:template>

<xsl:template match="Heading1"/>
<xsl:template match="Heading2"/>
<xsl:template match="Heading3"/>
<xsl:template match="Heading4"/>
<xsl:template match="Heading5"/>
----

The resulting XML starts like this:
----
<document title=" New Product Features"><section title="&#xA;New Product
Features"><section title="&#xA;Improved&#xA;Performance"><para>...
----

There may well be a lot of redundant/superfluous/naive code in these
stylesheets, as I am but an early learner, but can someone please explain
why my title and my headings are not being normalized as I expect?

TIA

Cheers
Trevor

Current Thread