Re: [xsl] How transform PDF to XSL-FO

Subject: Re: [xsl] How transform PDF to XSL-FO
From: Eliot Kimber <ekimber@xxxxxxxxxxxx>
Date: Thu, 22 May 2008 10:21:29 -0500
On 5/22/08 2:29 AM, "Byomokesh Sahoo" <sahoo.byomokesh@xxxxxxxxx> wrote:

> Hi All,
> 
> My client need convert PDF to XSL-FO in their project. I know XSL-FO
> is using to convert XML to PDF. I searched on google but not geting
> final solution. Its possible or not. If possible please give me some
> process or how it is

As Mike says, this is not really possible in the general case because of the
vagaries of PDF (and in some cases, programmatically impossible-to-decode
contents).

The PDFBox project ( http://www.pdfbox.org/) provides a pretty solid Java
library for working with PDF and if your PDF is not too complex (for
example, you don't have tables or multiple columns or private character
encodings), then it should be possible. The PDFBox package comes with some
sample code for generating HTML and the same techniques could be used to
generate FO code that tries as best it can to duplicate the original
formatting. I have done this in the past with some success using PDFBox (I
was able to extract Arabic-language text and generate HTML pages that
accurately reproduced the page layout).

If all you want to do is reproduce the original look of the pages using FO,
that is probably completely doable because you don't really care about
reading order, just absolute positioning. On the other hand, if you want
something that is either reflowable or re-editable, that's a much much
harder problem, and in the general case, intractable without human
intervention.

If you have multiple columns or otherwise irregular pages then you run into
the problem that PDF provides little or no indication of what the actual
reading order of the text is. This can only be solved with zoning, either
done manually or using some sort of very clever algorithm. For example, if
all the pages have the same arrangement of columns you can define a zone
definition and use that to guide the text extraction. But if each page has
different arrangements (as you would find in a modern grade school textbook
or heavily-designed magazine) then you pretty much have to zone each page by
hand.

There are also problems around detecting word boundaries, dehyphenating
words, detecting paragraph boundaries, and so on. It's fun stuff.

Note too that there are many data conversion providers who can do this sort
of work at a pretty low cost, so unless the input PDFs are pretty simple or
you have lots more time than money, it may be more cost effective to
outsource the conversion.

Cheers,

Eliot

-- 
Eliot Kimber
Senior Solutions Architect
"Bringing Strategy, Content, and Technology Together"
Main: 610.631.6770
www.reallysi.com
www.rsuitecms.com

Current Thread