Subject: RE: [xsl] Converting HTML to plain text From: "Radha Chandika" <RChandika@xxxxxxxxx> Date: Tue, 22 Jun 2004 10:25:02 -0700 |
Hi Wendell, I can constrain HTML pages to be valid XML. So, the hard part is solved. But still I don't know of a good solution to covert it to plain text. I tried XSL FO with Apache FOP using IBM Developerworks XSL for converting xhtml to fo (http://www-106.ibm.com/developerworks/library/x-xslfo2app/). It does proper conversion, but it has the following issues: 1) the formatting looks really bad. It has too much white space (most of the words are separated by multiple space chars instead of 1). 2) If I change the font family, font size and line height as suggested by the Apache FOP site, consequent lines are overwriting each other. 3) I had to specify the column width in pt. If by chance the column has a word that does not fit into the given width, it is truncated instead of wrapping. Note: Some others that I have tried. 1) w3m does a good job. But it is C++ code and I cannot use it. 2) Redhat has some java classes, but their conversion is very primitive. They don't format tables at all (each cell is rendered one after another vertically instead of a grid-like rendering). -- Radha -----Original Message----- From: Wendell Piez [mailto:wapiez@xxxxxxxxxxxxxxxx] Sent: Monday, June 21, 2004 4:56 PM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: Re: [xsl] Converting HTML to plain text Radha, This is (much) harder, in the general case, than it looks. This is due to the famous looseness of what is considered "HTML". (This laxity was once touted by HTML developers as a desirable feature, and probably did promote HTML's adoption in some respects.) HTML being more or less tag soup, saving it as plain text more or less means implementing a parser, a major part of a browser (XML parsing is comparatively trivial). If you can constrain the "HTML" coming in to a controlled dialect of XML (using HTML tags if you like for browser friendliness), you can achieve this straightforwardly using stylesheets. Alternatively, if you truly have to accept arbitrary "HTML", you can look at parsing technologies such as HTML tag soup parsers (see e.g. http://mercury.ccil.org/~cowan/XML/tagsoup/) that will emit XML SAX parsing events from HTML, or HTML DOM implementations that can write out XML from HTML, or an analogous tool; such a processor can be hooked into an XML pipeline. When it comes to writing out nice plain text output with XSLT (which is a perfectly fine tool for the job), you may find multiple passes to be a good way to proceed in any case. Generally, XSLT can't be used on arbitrary HTML. A poor man's solution is to use a tool like HTML Tidy to make XML for XSLT from the HTML, but I don't know if that could be adapted to your requirement for "a platform independent way" (IIRC it is compiled for different platforms). But if in general HTML-to-formatted-plain-text were easy, I think we'd see lots more of it. Cheers, Wendell At 03:15 PM 6/21/2004, you wrote: >I am looking around for any tools to convert html to plain text in a >platform independent way. I also need support for UTF-8 encoding as well >as a well formatted output of nested tables. What is the best way to do >this ? Is XSL FO recommended for this ? I looked around for any XSL to >convert HTML to FO, but I did not find any. > >The html to text tools I found on web are mostly windows based. The >remaining are not very good at converting nested tables in HTML to a >properly rendered plain text format. > >I appreciate any help > >-- RC > >--+------------------------------------------------------------------ >XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list >To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/ >or e-mail: <mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx> >--+-- ___&&__&_&___&_&__&&&__&_&__&__&&____&&_&___&__&_&&_____&__&__&&_____&_& &_ "Thus I make my own use of the telegraph, without consulting the directors, like the sparrows, which I perceive use it extensively for a perch." -- Thoreau --+------------------------------------------------------------------ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list To unsubscribe, go to: http://lists.mulberrytech.com/xsl-list/ or e-mail: <mailto:xsl-list-unsubscribe@xxxxxxxxxxxxxxxxxxxxxx> --+--
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] EXCEPTION: javax.xml.tra, Wolpert, Jake | Thread | RE: [xsl] Converting HTML to plain , Wendell Piez |
Re: [xsl] alternate backgrounds pro, Josh Canfield | Date | RE: [xsl] Converting HTML to plain , Wendell Piez |
Month |