Subject: Re: include text file From: Mike Brown <mike@xxxxxxxx> Date: Thu, 16 Nov 2000 11:48:55 -0700 (MST) |
DPawson@xxxxxxxxxxx wrote: > Which would increase the crap found on websites relative to XSLT produced > HTML today? One of the advantages of producing html via XSLT is that > its at least well formed. > > -1 for me on this one Mike. I think you were thinking I was advocating pulling in arbitrary HTML, keeping it in that possibly nonsensical structure and serializing it in exactly the same format. That's not what I am saying at all. As Tidy and the HTML side of the DOM proves, there's no reason you can't parse hideous HTML into a uniform node tree. Once you have it as a node tree, you can serialize it however you like. The node tree doesn't have a concept of well-formedness. When you have non-well-formed HTML as a source document, to get it into a node tree (DOM or XPath/XSLT) you have to be making some decisions along the way about element boundaries. It's not too difficult in theory to look at the tags and, knowing the difference between block and inline elements and the rules of containment and emptiness, to say that an element must end here even though its end tag hasn't been specified, and to ignore end tags for elements that have already been closed. So, for example, the input document could be this ugly beast: <p><B><i>bold<BR>italic paragraph</b> <ul> <li>italic list item </UL> </i> Through some voodoo that I'm sure the IE and Mozilla developers have had to develop several times over, it would become this node tree: |___element 'p' | |___element 'B' | |___element 'i' | |___text 'bold' | |___element 'br' | |___text 'italic paragraph' |___element 'ul' |___element 'li' |___text 'italic list item' That is, the inline elements always end inside the block elements, whether their closing tags are there or not, and the extraneous </i> is ignored. When this tree is serialized through the HTML output method in XSLT, it's going to come out as <p><B><i>bold<br>italic paragraph</i></B></p> <ul> <li>italic list item </ul> which is not quite what was input and was not quite what was intended (the list item is not italic), but c'est la vie; browsers will be making the same judgement calls. The parser could easily be adapted, though, to make the same kinds of mistakes that HTML document authors make, thinking that <i> sets italic state and </i> unsets the state, in which case it could inject an 'i' element ahead of any character data that isn't in an 'i' already, until the state is explicitly unset. Either way, you'll have succeeded at creating a node-set from an HTML document. This does not compromise any principles of XSLT. - Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://www.skew.org/xml/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: include text file, DPawson | Thread | Re: include text file, David Carlisle |
ANN: XSLTDoc Alpha, Jeni Tennison | Date | RE: Accessing nodes, Xiaocun Xu |
Month |