Re: Converting poorly formed HTML into well-formed XML

Subject: Re: Converting poorly formed HTML into well-formed XML
From: "Raffaele Sena" <raff@xxxxxxxxxxxx>
Date: Tue, 26 Sep 2000 17:46:22 -0700
>
> The HTML has been written by various web developers over a period of time,
> so it is very inconsistent in formatting, use of quotation marks in
> attributes, etc.
>
    But, most of all, is the HTML correct, or conformant ?

> Does XSLT have the facilities to directly read in the poorly formed HTML?
> And if so, what needs to be done.
>
    Nope, unless it is valid XML (that would be XHTML)

> I've already begun developing the latter (custom) solution, but thought I'd
> double check to see if there are any HTML -> XHTML converters available.
>
    Check out HTML Tidy, from the W3C consortium (www.w3.org).
    It's a C application that cleans up messy (and incorrect HTML) and
    has an option to generate XHTML.

    The main problem of developing your own converter is that either you are
    sure your HTML is correct (and so you only need to fix cases, quotes in
    attributes, entitities and close the few HTML empty tags) or you will go
    crazy trying to cope with all the possible errors that the "official" web
    browsers accept but that would kill any simple parser.

    Anyway, I would be interested in knowing if there is any similar
application/package
    in java. I would like to convert some pages (where I pretty much know the
format)
    into XHTML and from there output the content in XML.

    The only other package I found is in Perl (HTML::TreeBuilder). It has a
smart
    input parser and the author explains how he had to add a lot of hardcoded
stuff
    to cover a lot of weird cases. I wrote a few lines of perl that reads in
an HTML
    file and output XHTML, if anyone is interested.

-- Raffaele

-----------------------------------------------------
raff@xxxxxxxxxxxx (::) http://www.aromatic.org/~raff/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread