Converting poorly formed HTML into well-formed XML

Subject: Converting poorly formed HTML into well-formed XML
From: Joseph Fourness <josephf@xxxxxxxxxxx>
Date: Tue, 26 Sep 2000 15:56:20 -0700
Hello,

I am currently developing a system that converts arbitrary poorly formed
HTML into well formed XML (or XHTML).

Example of HTML:

<TD valign=TOP width="100">
<br>
<A href="http://www.mulberrytech.com"; target=_top>Link</a>

The HTML has been written by various web developers over a period of time,
so it is very inconsistent in formatting, use of quotation marks in
attributes, etc.

I need to convert these files (approx.  120,000) into XHTML for usability
with an XSLT processor.

Desired output:

<td valign="top" width="100">
<br/>
<a href="http://www.mulberrytech.com"; target="_top">Link</a>

Does XSLT have the facilities to directly read in the poorly formed HTML?
And if so, what needs to be done.

Or,

Will designing a custom parser that builds a DOM from the poorly formed HTML
to then be output to an XML file, or directly processed by an XSLT document,
be the best solution.

I've already begun developing the latter (custom) solution, but thought I'd
double check to see if there are any HTML -> XHTML converters available.

Thanks in advance for your help,

Joe Fourness


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread