Re: [xsl] XML apparently cannot be used for general text markup: whitespace gripe

Subject: Re: [xsl] XML apparently cannot be used for general text markup: whitespace gripe
From: "Thomas B. Passin" <tpassin@xxxxxxxxxxxx>
Date: Tue, 19 Mar 2002 10:37:21 -0500
[Chad Jones]
>
>  I've noticed a lot of xml-derived web pages out there have screwed up
> whitespace (words crammed together or an incorrect space before ending
> punctuation).
>
>  My conclusion is that blocks straight text (such as paragraphs) cannot be
> further marked up with XML without screwing up spacing.
>
>  For example, can anyone get this simple document into HTML without either
> removing required spaces or adding inappropriate spaces?
>
>   <?xml version="1.0"?>
>   <book>
>      <par>
>       Is his name really <first>John</first>      <last>Doe</last>?
>     </par>
>   </book>
>

You have to distinguish between several different cases.

1) What you see in a browser.  Normally (except text in special elements
like <pre>) a browser collapses multiple whitespace character sequences down
to a single space.  The spaces present in the source file display as single
spaces.

2) What the xml parser does by default (or by instruction).  This affects
the whitespace that is passed to the stylesheet processor, and specifically
whitespace-only nodes.  If whitespace-only nodes are removed, you could get
the run-together words you have seen.

Microsoft's msxml3 processor (to name one) removes such nodes by default.
If you are using it in such a way that you can't tell it to preserve the
whitespace-only nodes, you can get the same effect by including an
xml:space='preserve' attribute in the root element of the xml file.  Then
your spaces will remain.

3) What the xslt processor does.  This is controlled by xsl:preserve-space
or xsl:strip-space elements, which also operate on whitespace-only nodes.
By default the whitespace-only nodes are preserved.

The result is controlled by the default or instructed behavior of the parser
and the presence or absence of the other instructions.  For the Microsoft
parser, the whitespace-only nodes are removed unless you instruct otherwise,
for Saxon they stay.  I have noticed that the xml:space attribute in the
source file has priority over xsl:strip-space='preserve'  in the stylesheet
(at least for msxml3 and Saxon), but I don't know if that is specified
somewhere or not (Mike Kay will no doubt give us the definitive answer
here).

Cheers,

Tom P


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread