Subject: Re: Converting poorly formed HTML into well-formed XML From: "Raffaele Sena" <raff@xxxxxxxxxxxx> Date: Tue, 26 Sep 2000 17:46:22 -0700 |
> > The HTML has been written by various web developers over a period of time, > so it is very inconsistent in formatting, use of quotation marks in > attributes, etc. > But, most of all, is the HTML correct, or conformant ? > Does XSLT have the facilities to directly read in the poorly formed HTML? > And if so, what needs to be done. > Nope, unless it is valid XML (that would be XHTML) > I've already begun developing the latter (custom) solution, but thought I'd > double check to see if there are any HTML -> XHTML converters available. > Check out HTML Tidy, from the W3C consortium (www.w3.org). It's a C application that cleans up messy (and incorrect HTML) and has an option to generate XHTML. The main problem of developing your own converter is that either you are sure your HTML is correct (and so you only need to fix cases, quotes in attributes, entitities and close the few HTML empty tags) or you will go crazy trying to cope with all the possible errors that the "official" web browsers accept but that would kill any simple parser. Anyway, I would be interested in knowing if there is any similar application/package in java. I would like to convert some pages (where I pretty much know the format) into XHTML and from there output the content in XML. The only other package I found is in Perl (HTML::TreeBuilder). It has a smart input parser and the author explains how he had to add a lot of hardcoded stuff to cover a lot of weird cases. I wrote a few lines of perl that reads in an HTML file and output XHTML, if anyone is interested. -- Raffaele ----------------------------------------------------- raff@xxxxxxxxxxxx (::) http://www.aromatic.org/~raff/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: Converting poorly formed HTML i, Lawrence Mielniczuk | Thread | Word Wrap and PRE, Richard Saunders |
Re: Converting poorly formed HTML i, Steve Muench | Date | Word Wrap and PRE, Richard Saunders |
Month |