RE: [xsl] Converting HTML to plain text

Subject: RE: [xsl] Converting HTML to plain text
From: "Radha Chandika" <RChandika@xxxxxxxxx>
Date: Tue, 22 Jun 2004 10:25:02 -0700
Hi Wendell,

I can constrain HTML pages to be valid XML. So, the hard part is solved.
But still I don't know of a good solution to covert it to plain text. I
tried XSL FO with Apache FOP using IBM Developerworks XSL for converting
xhtml to fo
(http://www-106.ibm.com/developerworks/library/x-xslfo2app/). It does
proper conversion, but it has the following issues:
1) the formatting looks really bad. It has too much white space (most of
the words are separated by multiple space chars instead of 1).
2) If I change the font family, font size and line height as suggested
by the Apache FOP site, consequent lines are overwriting each other.
3) I had to specify the column width in pt. If by chance the column has
a word that does not fit into the given width, it is truncated instead
of wrapping.

Note: Some others that I have tried.
1) w3m does a good job. But it is C++ code and I cannot use it. 
2) Redhat has some java classes, but their conversion is very primitive.
They don't format tables at all (each cell is rendered one after another
vertically instead of a grid-like rendering).

-- Radha

-----Original Message-----
From: Wendell Piez [mailto:wapiez@xxxxxxxxxxxxxxxx] 
Sent: Monday, June 21, 2004 4:56 PM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: Re: [xsl] Converting HTML to plain text

Radha,

This is (much) harder, in the general case, than it looks. This is due
to 
the famous looseness of what is considered "HTML". (This laxity was once

touted by HTML developers as a desirable feature, and probably did
promote 
HTML's adoption in some respects.) HTML being more or less tag soup,
saving 
it as plain text more or less means implementing a parser, a major part
of 
a browser (XML parsing is comparatively trivial).

If you can constrain the "HTML" coming in to a controlled dialect of XML

(using HTML tags if you like for browser friendliness), you can achieve 
this straightforwardly using stylesheets.

Alternatively, if you truly have to accept arbitrary "HTML", you can
look 
at parsing technologies such as HTML tag soup parsers (see e.g. 
http://mercury.ccil.org/~cowan/XML/tagsoup/) that will emit XML SAX
parsing 
events from HTML, or HTML DOM implementations that can write out XML
from 
HTML, or an analogous tool; such a processor can be hooked into an XML 
pipeline.

When it comes to writing out nice plain text output with XSLT (which is
a 
perfectly fine tool for the job), you may find multiple passes to be a
good 
way to proceed in any case.

Generally, XSLT can't be used on arbitrary HTML. A poor man's solution
is 
to use a tool like HTML Tidy to make XML for XSLT from the HTML, but I 
don't know if that could be adapted to your requirement for "a platform 
independent way" (IIRC it is compiled for different platforms).

But if in general HTML-to-formatted-plain-text were easy, I think we'd
see 
lots more of it.

Cheers,
Wendell

At 03:15 PM 6/21/2004, you wrote:
>I am looking around for any tools to convert html to plain text in a 
>platform independent way. I also need support for UTF-8 encoding as
well 
>as a well formatted output of nested tables. What is the best way to do

>this ? Is XSL FO recommended for this ? I looked around for any XSL to 
>convert HTML to FO, but I did not find any.
>
>The html to text tools I found on web are mostly windows based. The 
>remaining are not very good at converting nested tables in HTML to a 
>properly rendered plain text format.
>
>I appreciate any help
>
>-- RC
>
>--+------------------------------------------------------------------
>XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
>To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
>or e-mail: <mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx>
>--+--

___&&__&_&___&_&__&&&__&_&__&__&&____&&_&___&__&_&&_____&__&__&&_____&_&
&_
     "Thus I make my own use of the telegraph, without consulting
      the directors, like the sparrows, which I perceive use it
      extensively for a perch." -- Thoreau  


--+------------------------------------------------------------------
XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list
To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/
or e-mail: <mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx>
--+--


Current Thread