Re: [xsl] Converting HTML to plain text

Subject: Re: [xsl] Converting HTML to plain text
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Mon, 21 Jun 2004 19:56:15 -0400
Radha,

This is (much) harder, in the general case, than it looks. This is due to the famous looseness of what is considered "HTML". (This laxity was once touted by HTML developers as a desirable feature, and probably did promote HTML's adoption in some respects.) HTML being more or less tag soup, saving it as plain text more or less means implementing a parser, a major part of a browser (XML parsing is comparatively trivial).

If you can constrain the "HTML" coming in to a controlled dialect of XML (using HTML tags if you like for browser friendliness), you can achieve this straightforwardly using stylesheets.

Alternatively, if you truly have to accept arbitrary "HTML", you can look at parsing technologies such as HTML tag soup parsers (see e.g. http://mercury.ccil.org/~cowan/XML/tagsoup/) that will emit XML SAX parsing events from HTML, or HTML DOM implementations that can write out XML from HTML, or an analogous tool; such a processor can be hooked into an XML pipeline.

When it comes to writing out nice plain text output with XSLT (which is a perfectly fine tool for the job), you may find multiple passes to be a good way to proceed in any case.

Generally, XSLT can't be used on arbitrary HTML. A poor man's solution is to use a tool like HTML Tidy to make XML for XSLT from the HTML, but I don't know if that could be adapted to your requirement for "a platform independent way" (IIRC it is compiled for different platforms).

But if in general HTML-to-formatted-plain-text were easy, I think we'd see lots more of it.

Cheers,
Wendell

At 03:15 PM 6/21/2004, you wrote:
I am looking around for any tools to convert html to plain text in a platform independent way. I also need support for UTF-8 encoding as well as a well formatted output of nested tables. What is the best way to do this ? Is XSL FO recommended for this ? I looked around for any XSL to convert HTML to FO, but I did not find any.

The html to text tools I found on web are mostly windows based. The remaining are not very good at converting nested tables in HTML to a properly rendered plain text format.

I appreciate any help

-- RC

--+------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx>
--+--

___&&__&_&___&_&__&&&__&_&__&__&&____&&_&___&__&_&&_____&__&__&&_____&_&&_
"Thus I make my own use of the telegraph, without consulting
the directors, like the sparrows, which I perceive use it
extensively for a perch." -- Thoreau



Current Thread