Re: [xsl] Collapsing run-on tag chains not working in saxon or xalan

Subject: Re: [xsl] Collapsing run-on tag chains not working in saxon or xalan
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Mon, 01 Nov 2004 15:06:25 -0500
Richard,

I'm guessing you are seeing what you see because MSXML is stripping all the whitespace-only text nodes from your input. Strictly speaking it should not be doing this, although in your case this bit of non-conformance proves to be a "feature" since you are traversing the preceding-sibling axis looking for the next element back, and you aren't expecting those nodes to be there.

To make sure they get stripped by all processors and not just by MSXML (which takes this liberty without asking), you could use

<xsl:strip-space elements="*"/>

at the top of your stylesheet. Your stylesheet will then behave in MSXML the way it does in the more strictly conformant and less "helpful" processors. Note, however, that this isn't a very robust solution unless you have purely data-oriented XML (that is, no mixed content, wherein both element markup and text are to be found together, and in which whitespace-only text nodes are very occasionally "really there"), in which case it's safe enough -- but your XML isn't like this.

Another approach would be to write heavier-duty code to filter these nodes out. Where you have:

(local-name(preceding-sibling::node()[1])='ilink')

you could use, instead,

(local-name(preceding-sibling::node()[self::* or normalize-space()][1])='ilink')

or better (the same in stronger XPath)

preceding-sibling::node()[self::* or normalize-space()][1]/self::ilink

which says, basically, "the first preceding-sibling node that is either an element or has a non-whitespace string value, if it's an ilink element".

This will work in any processor irrespective of whether you have "cosmetic" whitespace in your input.

Another approach is to make sure that there is never such invisible content in your input; it's probably good practice to keep cosmetic whitespace out of documents with mixed content (or at least to manage it very carefully) in any case.

If you're confused by what I mean about "whitespace only" or "cosmetic" text nodes, take a look at

<ilink>
  <sup>
    <b>
      <i>o</i>
    </b>
  </sup>
</ilink>
<ilink>
  <sub>
    <b>
      <i>t</i>
    </b>
  </sub>
</ilink>

...where I see (count em) 15 text nodes. (You see only two? Well, the others have whitespace only but they're there.)

Cheers,
Wendell

At 02:05 PM 11/1/2004, you wrote:
Dear All,

With the following xml and xsl, the Microsoft msxmldom 4 is producing
the expected output, but xalan 2.4, 2.6, and saxon 6.5.3 are not: they
all produce
the same, unexpected output.

The purpose of this code is to collapse run-on chains like
<ilink>foo</link><link id="1234">bar</link> into a single tag
<link>foo bar<id id="1234"/>
</ilink>. The xsl will also collapse run-on chains of b, i, sup, sub,
and similar tags.

Can anyone explain to me whether xalan and saxon just have a bug, and
preferably how to get xalan and/or saxon to transform the way msxml4
does here
(which I believe is correct)?

TMIA,
Richard Bondi


Sample input:


<Chapter>
<ChapterTitle>The chapter title must be immediately followed by a
section title</ChapterTitle>
<Body>
<SectionTitle>The section title</SectionTitle>
<Title>Internal Links: _ilink</Title>
<Paragraph>The internal link to Proteins and Membranes, optionally
including the cont_id would look like: <ilink id="1234">Proteins and
Membranes</ilink>. You could also just type <ilink>Proteins and
Membranes</ilink>. Another option is <ilink>CBIO|Proteins and
Membranes</ilink>, or
even just <ilink id="1234"/>. You can also do <ilink
id="1234">CBIO|Proteins and Membranes</ilink>. Spaces on either side
of a pipe (|) are
optional.</Paragraph>
<Paragraph>Feel free to include crazy formatting, as in <ilink>CBIO|</ilink>
<ilink>
<i>Proteins</i>
</ilink>
<ilink> and Membranes</ilink> or <ilink>
<b>
<i>Pr</i>
</b>
</ilink>
<ilink>
<sup>
<b>
<i>o</i>
</b>
</sup>
</ilink>
<ilink>
<sub>
<b>
<i>t</i>
</b>
</sub>
</ilink>
<ilink>
<b>
<i>ei</i>
</b>
</ilink>
<ilink>
<b>
<i>
<u>n</u>
</i>
</b>
</ilink>
<ilink>
<b>
<i>s</i>
</b>
</ilink>
<ilink id="1234">and Membranes</ilink>. </Paragraph>
</Body>
</Chapter>



Xsl:


<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform";>
<xsl:output encoding="ISO-8859-1"/>
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
<!-- run of ilinks -->
<xsl:template match="ilink">
<xsl:if test="not(local-name(preceding-sibling::node()[1])='ilink')">
<ilink>
<xsl:if test="not(name(following-sibling::node()[1])='ilink')"><xsl:copy-of
select="@*"/></xsl:if>
<xsl:apply-templates/>
<xsl:if test="name(following-sibling::node()[1])='ilink'"><xsl:apply-templates
select="following-sibling::node()[1]" mode="following"/></xsl:if>
</ilink>
</xsl:if>
</xsl:template>
<xsl:template match="ilink" mode="following" >
<xsl:apply-templates/>
<xsl:if test="not(name(following-sibling::node()[1])='ilink') and
@*"><id><xsl:copy-of select="@*"/></id></xsl:if>
<xsl:if test="name(following-sibling::node()[1])='ilink'"><xsl:apply-templates
select="following-sibling::node()[1]" mode="following"/></xsl:if>
</xsl:template>
<!-- run of formatting tags, eg tags without attributes -->
<xsl:template match="b | i | sup | sub | u | smallcaps | red" priority="2">
<xsl:variable name="ename" select="name(.)"/>
<xsl:if test="not(local-name(preceding-sibling::node()[1])=string($ename))">
<xsl:element name="{$ename}">
<xsl:apply-templates/>
<xsl:if test="name(following-sibling::node()[1])=string($ename)"><xsl:apply-templates
select="following-sibling::node()[1]" mode="following"/>
</xsl:if>
</xsl:element>
</xsl:if>
</xsl:template>
<xsl:template match="b | i | sup | sub | u | smallcaps | red"
mode="following" >
<xsl:variable name="ename" select="name(.)"/>
<xsl:apply-templates/>
<xsl:if test="name(following-sibling::node()[1])=string($ename)"><xsl:apply-templates
select="following-sibling::node()[1]" mode="following"/>
</xsl:if>
</xsl:template>
<xsl:template match="@* | node()">
<xsl:copy >
<xsl:apply-templates select="@*" />
<xsl:apply-templates />
</xsl:copy>
</xsl:template>
</xsl:stylesheet>



Output using msxml4 (correct output, IMHO):


<Chapter>
<ChapterTitle>The chapter title must be immediately followed by a
section title</ChapterTitle>
<Body>
<SectionTitle>The section title</SectionTitle>
<Title>Internal Links: _ilink</Title>
<Paragraph>The internal link to Proteins and Membranes, optionally
including the cont_id would look like: <ilink id="1234">Proteins and
Membranes</ilink>. You could also just type <ilink>Proteins and
Membranes</ilink>. Another option is <ilink>CBIO|Proteins and
Membranes</ilink>, or
even just <ilink id="1234"/>. You can also do <ilink
id="1234">CBIO|Proteins and Membranes</ilink>. Spaces on either side
of a pipe (|) are
optional.</Paragraph>
<Paragraph>Feel free to include crazy formatting, as in
<ilink>CBIO|<i>Proteins</i> and Membranes</ilink> or <ilink>
<b>
<i>Pr</i>
</b>
<sup>
<b>
<i>o</i>
</b>
</sup>
<sub>
<b>
<i>t</i>
</b>
</sub>
<b>
<i>ei</i>
</b>
<b>
<i>
<u>n</u>
</i>
</b>
<b>
<i>s</i>
</b>and Membranes<id id="1234"/>
</ilink>. </Paragraph>
</Body>
</Chapter>



Output of xalan 2.4, 2.6.0, and instant saxon 6.5.3 (appears to do nothing, actually):

<Chapter>
<ChapterTitle>The chapter title must be immediately followed by a
section title</ChapterTitle>
<Body>
<SectionTitle>The section title</SectionTitle>
<Title>Internal Links: _ilink</Title>
<Paragraph>The internal link to Proteins and Membranes, optionally
including the cont_id would look like: <ilink id="1234">Proteins and
Membranes</ilink>. You could also just type <ilink>Proteins and
Membranes</ilink>. Another option is <ilink>CBIO|Proteins and
Membranes</ilink>, or
even just <ilink id="1234"/>. You can also do <ilink
id="1234">CBIO|Proteins and Membranes</ilink>. Spaces on either side
of a pipe (|) are
optional.</Paragraph>
<Paragraph>Feel free to include crazy formatting, as in <ilink>CBIO|</ilink>
<ilink>
<i>Proteins</i>
</ilink>
<ilink> and Membranes</ilink> or <ilink>
<b>
<i>Pr</i>
</b>
</ilink>
<ilink>
<sup>
<b>
<i>o</i>
</b>
</sup>
</ilink>
<ilink>
<sub>
<b>
<i>t</i>
</b>
</sub>
</ilink>
<ilink>
<b>
<i>ei</i>
</b>
</ilink>
<ilink>
<b>
<i>
<u>n</u>
</i>
</b>
</ilink>
<ilink>
<b>
<i>s</i>
</b>
</ilink>
<ilink id="1234">and Membranes</ilink>. </Paragraph>
</Body>
</Chapter>


======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread