Re: [xsl] Removing unwanted space

Subject: Re: [xsl] Removing unwanted space
From: "Peter Flynn peter@xxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Fri, 4 Jun 2021 21:41:12 -0000
On 04/06/2021 00:54, Charles O'Connor coconnor@xxxxxxxxxxxx wrote:
> OK, I've tried this a bunch of ways and failed (using XSLT 2.0).
> 
> The XML I'm working with has a bunch of unwanted whitespace in all sorts of places, but looking just at paragraphs, it can have
> 
> <p>
> 	The rain in <bold>Spain</bold> <italic>is</italic> wet.
> </p>

This illustrates a recurrent and persistent problem in getting the logic
of dealing with white-space adjusted for the circumstances.

There is no built-in ltrim() or rtrim() function for removing
white-space from the start or end of character data in mixed content,
and there is no "interior" version of normalize-space() which leaves the
start and end untouched, but collapses white-space internally. All can
very simply be written, of course.

The xsl:strip-space setting can be used to strip white-space nodes
between the start of mixed content and a child element, but I believe it
does not remove white-space at the start of mixed content where the
first non-white-space token is character data content.

In the absence of a schema or DTD to dictate where mixed content is
used, the indent="yes" attribute on the xsl:output element may indent
subelements in mixed content.

My own rules for dealing with this are something like:

Pass all text nodes in mixed content through a template which will strip
space from the start (if it's the first text node in an element) or the
end (if it's the last text node in an element) or both (if it occurs
somewhere else in the element.

Test each subelement in mixed content for the immediate adjacency of
another element node BEFORE it, and output a single space to put back
the one omitted by the parser.

For example, given

<doc>
<p>
        The rain in <bold>Spain</bold> <italic>is</italic> wet.
</p>
<p>
        <bold>The rain in Spain is wet.</bold>
</p>
<p>
    <anchor> </anchor>
    The rain in <bold> <underline> Spain </underline> </bold> <italic>
is </italic> wet.
</p>
</doc>

with

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
                version="3.0">

  <xsl:output method="xml"/>
  <xsl:strip-space elements="*"/>

  <xsl:template match="doc | p">
    <xsl:element name="{name()}">
      <xsl:apply-templates select="node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="bold | italic | underline | anchor">
    <xsl:call-template name="compensate-space"/>
    <xsl:element name="{name()}">
      <xsl:apply-templates select="node()"/>
    </xsl:element>
  </xsl:template>

  <xsl:template match="text()">
    <xsl:choose>
      <xsl:when test="not(preceding-sibling::text())
                      and
                      not(following-sibling::text())">
        <xsl:value-of
          select="replace(replace(.,'^[\s][\s]*',''),'[\s][\s]*$','')"/>
      </xsl:when>
      <xsl:when test="not(preceding-sibling::text())">
        <xsl:value-of select="replace(.,'^[\s][\s]*','')"/>
      </xsl:when>
      <xsl:when test="not(following-sibling::text())">
        <xsl:value-of select="replace(.,'[\s][\s]*$','')"/>
      </xsl:when>
    </xsl:choose>
  </xsl:template>

  <xsl:template name="compensate-space">
    <xsl:if test="preceding-sibling::node() and
                  preceding-sibling::* and
                  count(preceding-sibling::node()[1] |
                        preceding-sibling::*[1])=1">
      <xsl:text> </xsl:text>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

we get

<?xml version="1.0" encoding="UTF-8"?><doc><p>The rain in
<bold>Spain</bold> <italic>is</italic> wet.</p><p><bold>The rain in
Spain is wet.</bold></p><p><anchor/>The rain in
<bold><underline>Spain</underline></bold> <italic>is</italic> wet.</p></doc>

This does not address the conversion of the anchor element to NET
format, nor the indentation of p elements which would be conventional.

Peter

Current Thread