Re: Normalizing string containing entities

Subject: Re: Normalizing string containing entities
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Tue, 18 Jul 2000 10:29:13 +0100
Pierre-Yves,

I'm actually thinking things are easier if you leave the mixed content
alone. That is, normalize whitespace in it, but don't wrap it in anything.
FWIW, I disagree with the notion that mixed content makes life harder.
Wrapping up the text nodes doesn't help with this problem -- actually, the
fact that mixed content is what distinguishes your element nodes for
normalizing, is where the solution is...

Why not strip it all, then put it back where you want it? So....

<xsl:stylesheet version="1.0"
                xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>

<xsl:output method="xml" indent="yes"/>

<xsl:template match="text()">
  <!-- strip extra whitespace from text nodes
       (including leading and trailing whitespace) -->
  <xsl:value-of select="normalize-space(.)"/>
</xsl:template>

<xsl:template match="*">
  <!-- default element rule is identity transform -->
  <xsl:copy>
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>

<xsl:template match="*[../text()[normalize-space(.) != '']]">
  <!-- but this template matches any element appearing in mixed content -->
  <xsl:variable name="textbefore"
       select="preceding-sibling::node()[1][self::text()]"/>
  <xsl:variable name="textafter"
       select="following-sibling::node()[1][self::text()]"/>
  <!-- Either of the preceding variables will be an empty node set 
       if the neighbor node is not text(), right? -->
  <xsl:variable name="prevchar"
       select="substring($textbefore, string-length($textbefore))"/>
  <xsl:variable name="nextchar"
       select="substring($textafter, 1, 1)"/>

  <!-- Now the action: -->
  <xsl:if test="$prevchar != normalize-space($prevchar)">
  <!-- If the original text had a space before, add one back -->
    <xsl:text> </xsl:text>
  </xsl:if>

  <xsl:copy>
  <!-- Copy the element over -->
    <xsl:copy-of select="@*"/>
    <xsl:apply-templates/>
  </xsl:copy>

  <xsl:if test="$nextchar != normalize-space($nextchar)">
  <!-- If the original text had a space after, add one back -->
    <xsl:text> </xsl:text>
  </xsl:if>

</xsl:template>

</xsl:stylesheet>

Using David's test:
<x>
<para>Some    text    <em>some    other   text</em>   remaining text</para>
<para>Some    text<em>    some    other   text</em>   remaining text</para>
<para>Some    text    <em>   some    other   text</em>   remaining text</para>
<para>Some    text    <em>some    other   text </em>   remaining text</para>
<para>Some    text    <em>some    other   text </em>remaining text</para>
<para> Some    text    <em>some    other   text</em>   remaining text </para>
</x>

We get output (using Saxon)
<x>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text<em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
   <para>Some text <em>some other text</em>remaining text</para>
   <para>Some text <em>some other text</em> remaining text</para>
</x>

If you wanted to get space before the <em> element in the second case or
after in the fifth case, the logic could be extended to catch them (left as
an exercise :-).

Good luck!
Wendell

At 10:20 AM 7/14/00 -0500, Imran wrote:
>> Consider, for example, the following:
>> 
>> <para>Some    text    <em>some    other   text</em>   remaining
text</para>
>
><snip/>
>
>> The answer I found in several books is that we should not have elements
>> mixing CDATA and subelements. If we apply this rule, it is impossible to
>> represent the real structure of text.
>
>not entirely true.  There's nothing preventing you from marking up the plain
>text as "plain" the same way you mark-up the emphasized text as "em".
>
>eg, do this instead:
>
><para>
> <plain>Some    text    </plain>
> <em>some   other text</em>
> <plain>   remaining text</plain>
></para>
>
>
>the drawbacks to this solution are that the structure can seem more
>complicated, and it will use more memory (I think -- i'm no expert on
>that...).  in addition, often-times you dont' even have control over the
>original structure, so you have to use somebody else's model which mixes
>content.
>    but if you can set up your structure this way, it makes XSLT processing
>much easier, as well as processing for other XML apps.
>
>(this doesn't actually help w/ your initial problem, though, b/c there's
>still the matter of stripping the whitespace in the middle of text nodes...)
>
>Imran
>
>
> XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>
>

======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread