Subject: RE: [xsl] Converting HTML to plain text From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx> Date: Tue, 22 Jun 2004 13:30:15 -0400 |
Good luck, Wendell
Hi Wendell,
I can constrain HTML pages to be valid XML. So, the hard part is solved. But still I don't know of a good solution to covert it to plain text. I tried XSL FO with Apache FOP using IBM Developerworks XSL for converting xhtml to fo (http://www-106.ibm.com/developerworks/library/x-xslfo2app/). It does proper conversion, but it has the following issues: 1) the formatting looks really bad. It has too much white space (most of the words are separated by multiple space chars instead of 1). 2) If I change the font family, font size and line height as suggested by the Apache FOP site, consequent lines are overwriting each other. 3) I had to specify the column width in pt. If by chance the column has a word that does not fit into the given width, it is truncated instead of wrapping.
Note: Some others that I have tried. 1) w3m does a good job. But it is C++ code and I cannot use it. 2) Redhat has some java classes, but their conversion is very primitive. They don't format tables at all (each cell is rendered one after another vertically instead of a grid-like rendering).
-- Radha
====================================================================== Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx Mulberry Technologies, Inc. http://www.mulberrytech.com 17 West Jefferson Street Direct Phone: 301/315-9635 Suite 207 Phone: 301/315-9631 Rockville, MD 20850 Fax: 301/315-8285 ---------------------------------------------------------------------- Mulberry Technologies: A Consultancy Specializing in SGML and XML ======================================================================
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] Converting HTML to plain , Radha Chandika | Thread | Re: [xsl] Converting HTML to plain , Larry Kollar |
RE: [xsl] Converting HTML to plain , Radha Chandika | Date | Re: [xsl] conversion from xslt2.0 t, J.Pietschmann |
Month |