Re: include text file

Subject: Re: include text file
From: David Carlisle <davidc@xxxxxxxxx>
Date: Thu, 16 Nov 2000 19:30:25 GMT
>  As Tidy and
> the HTML side of the DOM proves, there's no reason you can't parse hideous
> HTML into a uniform node tree.

I think you have to separate two cases. Omitting end tags (and in some
cases begin tags) isn't hideousness, it is a standard SGML feature,
the HTML DTD has sufficient declarations to allow an SGML parser to
infer the missing tags.

HTML4.decl says



which tells an SGML parser that these features are being used.

and for example the DTD has

<!ELEMENT BODY O O (%block;|SCRIPT)+ +(INS|DEL) -- document body -->
which says you can omit both the begin and end tag of the body element
and the parser will infer it.

This is how sx (for example) in James Clark's sp suite can parse
HTML (or any SGML) files and output the parse tree in XML syntax.
You want (I think) to do the same without the overhead of
writing to a file and reading back. So you just want a SAX enabled SGML
parser. I am sure I saw an announcement of such a beast once, but
a quick look in google failed to show anything likely.

> Through some voodoo that I'm sure the IE and Mozilla developers have had
> to develop several times over, it would become this node tree:

What the browsers do is something rather different. They are designed to
avoid errors at all cost so accept not just "non well formed" HTML in
the sense of HTML with ommittable tags omitted, but rather try to accept
any random character stream that looks like it might have been intended
to perhaps be html. You could perhaps have a sax interface to such a
permissive parser, but unlike the case above, here you'd have to accept
that the parse might fail in more interesting ways, and that the result
of any parse might be more the result of creative thinking by the parser
writer than something specified in the file....


This message has been checked for all known viruses by Star Internet delivered
through the MessageLabs Virus Control Centre. For further information visit

 XSL-List info and archive:

Current Thread