Subject: Re: include text file From: David Carlisle <davidc@xxxxxxxxx> Date: Thu, 16 Nov 2000 19:30:25 GMT |
> As Tidy and > the HTML side of the DOM proves, there's no reason you can't parse hideous > HTML into a uniform node tree. I think you have to separate two cases. Omitting end tags (and in some cases begin tags) isn't hideousness, it is a standard SGML feature, the HTML DTD has sufficient declarations to allow an SGML parser to infer the missing tags. HTML4.decl says FEATURES OMITTAG YES ^^^^^^^^^^^^ which tells an SGML parser that these features are being used. and for example the DTD has <!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body --> ^^^ which says you can omit both the begin and end tag of the body element and the parser will infer it. This is how sx (for example) in James Clark's sp suite can parse HTML (or any SGML) files and output the parse tree in XML syntax. You want (I think) to do the same without the overhead of writing to a file and reading back. So you just want a SAX enabled SGML parser. I am sure I saw an announcement of such a beast once, but a quick look in google failed to show anything likely. > Through some voodoo that I'm sure the IE and Mozilla developers have had > to develop several times over, it would become this node tree: What the browsers do is something rather different. They are designed to avoid errors at all cost so accept not just "non well formed" HTML in the sense of HTML with ommittable tags omitted, but rather try to accept any random character stream that looks like it might have been intended to perhaps be html. You could perhaps have a sax interface to such a permissive parser, but unlike the case above, here you'd have to accept that the parse might fail in more interesting ways, and that the result of any parse might be more the result of creative thinking by the parser writer than something specified in the file.... David _____________________________________________________________________ This message has been checked for all known viruses by Star Internet delivered through the MessageLabs Virus Control Centre. For further information visit http://www.star.net.uk/stats.asp XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: include text file, Mike Brown | Thread | Re: include text file, Mike Brown |
RE: Accessing nodes, Xiaocun Xu | Date | Problem Accessing Nodes, Mangano, Chris |
Month |