Subject: Re: [xsl] How transform PDF to XSL-FO From: Eliot Kimber <ekimber@xxxxxxxxxxxx> Date: Thu, 22 May 2008 10:21:29 -0500 |
On 5/22/08 2:29 AM, "Byomokesh Sahoo" <sahoo.byomokesh@xxxxxxxxx> wrote: > Hi All, > > My client need convert PDF to XSL-FO in their project. I know XSL-FO > is using to convert XML to PDF. I searched on google but not geting > final solution. Its possible or not. If possible please give me some > process or how it is As Mike says, this is not really possible in the general case because of the vagaries of PDF (and in some cases, programmatically impossible-to-decode contents). The PDFBox project ( http://www.pdfbox.org/) provides a pretty solid Java library for working with PDF and if your PDF is not too complex (for example, you don't have tables or multiple columns or private character encodings), then it should be possible. The PDFBox package comes with some sample code for generating HTML and the same techniques could be used to generate FO code that tries as best it can to duplicate the original formatting. I have done this in the past with some success using PDFBox (I was able to extract Arabic-language text and generate HTML pages that accurately reproduced the page layout). If all you want to do is reproduce the original look of the pages using FO, that is probably completely doable because you don't really care about reading order, just absolute positioning. On the other hand, if you want something that is either reflowable or re-editable, that's a much much harder problem, and in the general case, intractable without human intervention. If you have multiple columns or otherwise irregular pages then you run into the problem that PDF provides little or no indication of what the actual reading order of the text is. This can only be solved with zoning, either done manually or using some sort of very clever algorithm. For example, if all the pages have the same arrangement of columns you can define a zone definition and use that to guide the text extraction. But if each page has different arrangements (as you would find in a modern grade school textbook or heavily-designed magazine) then you pretty much have to zone each page by hand. There are also problems around detecting word boundaries, dehyphenating words, detecting paragraph boundaries, and so on. It's fun stuff. Note too that there are many data conversion providers who can do this sort of work at a pretty low cost, so unless the input PDFs are pretty simple or you have lots more time than money, it may be more cost effective to outsource the conversion. Cheers, Eliot -- Eliot Kimber Senior Solutions Architect "Bringing Strategy, Content, and Technology Together" Main: 610.631.6770 www.reallysi.com www.rsuitecms.com
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] How transform PDF to XSL-, Michael Kay | Thread | [xsl] xslt architecture, Andrew Welch |
Re: [xsl] Upgrading from server sid, Dimitre Novatchev | Date | RE: [xsl] One-based indexes in XPat, Justin Johansson |
Month |