Re: include text file

Subject: Re: include text file
From: Mike Brown <mike@xxxxxxxx>
Date: Thu, 16 Nov 2000 11:48:55 -0700 (MST)
DPawson@xxxxxxxxxxx wrote:
> Which would increase the crap found on websites relative to XSLT produced
> HTML today? One of the advantages of producing html via XSLT is that
> its  at least well formed.  
> 
> -1 for me on this one Mike.

I think you were thinking I was advocating pulling in arbitrary HTML,
keeping it in that possibly nonsensical structure and serializing it in
exactly the same format. That's not what I am saying at all. As Tidy and
the HTML side of the DOM proves, there's no reason you can't parse hideous
HTML into a uniform node tree. Once you have it as a node tree, you can
serialize it however you like. The node tree doesn't have a concept of
well-formedness.

When you have non-well-formed HTML as a source document, to get it into a
node tree (DOM or XPath/XSLT) you have to be making some decisions along
the way about element boundaries. It's not too difficult in theory to look
at the tags and, knowing the difference between block and inline elements
and the rules of containment and emptiness, to say that an element must
end here even though its end tag hasn't been specified, and to ignore end
tags for elements that have already been closed.

So, for example, the input document could be this ugly beast:

<p><B><i>bold<BR>italic paragraph</b>
<ul>
<li>italic list item
</UL>
</i>

Through some voodoo that I'm sure the IE and Mozilla developers have had
to develop several times over, it would become this node tree:


|___element 'p'
|     |___element 'B'
|           |___element 'i'
|                 |___text 'bold'
|                 |___element 'br'
|                 |___text 'italic paragraph'
|___element 'ul'   
      |___element 'li'
            |___text 'italic list item'

That is, the inline elements always end inside the block elements, whether
their closing tags are there or not, and the extraneous </i> is ignored.
When this tree is serialized through the HTML output method in XSLT, it's
going to come out as

<p><B><i>bold<br>italic paragraph</i></B></p>
<ul>
<li>italic list item
</ul>

which is not quite what was input and was not quite what was intended (the
list item is not italic), but c'est la vie; browsers will be making the
same judgement calls. The parser could easily be adapted, though, to make
the same kinds of mistakes that HTML document authors make, thinking that
<i> sets italic state and </i> unsets the state, in which case it could
inject an 'i' element ahead of any character data that isn't in an 'i'
already, until the state is explicitly unset.

Either way, you'll have succeeded at creating a node-set from an HTML
document. This does not compromise any principles of XSLT.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at         My XML/XSL resources:
webb.net in Denver, Colorado, USA           http://www.skew.org/xml/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread