RE: [xsl] html2xml?

Subject: RE: [xsl] html2xml?
From: naha@xxxxxxxxxx
Date: Wed, 27 Mar 2002 08:54:02 -0500 (EST)
Quoting Jarno.Elovirta@xxxxxxxxx:

> Hi,
> 
> > Has anyone done html to xml transformation?
> > Is it possible? If yes...how? A small example would be great =)
> 
> Run the HTML document throught Tidy/JTidy/SX/OpenXML and then process
> like a normal XML document.

I recently tried Tidy (http://www.w3.org/People/Raggett/tidy/) for 
this but found it overly-aggressive in its enforcement of the HTML
DTD.  For example, it transformed

    <a href="some-url">
        <div class="style">anchor text</div>
    </a>

into

    <a href="some-url">
    </a>

    <div class="style">anchor text</div>

which affects the semantics of the document.  I've not found a 
configuration parameter to control this behavior.

Wouldn't it be more correct to transform to

    <div class="style">
        <a href="some-uri">anchor text</a>
    </div>

I'm not familiar with any of the other suggested tools.

I was originally hoping for an all-XSL solution to my problem, but 
since it involves capturing and processing a tree (more like a 
shrub) of crossreferenced web pages, all of which need to be HTML->XML 
converted first, I've started writing a Java program for this.
I was hoping to use the HEX parser 
(http://www-uk.hpl.hp.com/people/sth/java/hex.html) but the version 
I fetched appears to be buggy and the author's email address is no 
longer valid.

I'm unaware of the other converters you suggested.  Google found
whao are apparently two different "OpenXML"s, one written in Java
and one in Delphi.  Could you provide a URL to the one you suggested?
The only information I found about the Java one was on CNET 
(http://download.cnet.com/downloads/0-14492-100-5565652.html) and the
site it refers to as the "publisher" (http://www.openxml.org/) seems 
to be a shopping site.

This topic would be a great candidate for a FAQ.  I didn't find one 
on Dave Pawson's site.

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread