Re: [xsl] PDF to XSL-FO

Subject: Re: [xsl] PDF to XSL-FO
From: "W. Eliot Kimber" <eliot@xxxxxxxxxx>
Date: Fri, 22 Nov 2002 09:47:30 -0600
Noel Golding wrote:
One business problem would be to transform already existing pdf document to
xml.  FO could be the first step to getting it into an xml schema for the
organization.  I would benefit from such a tool.

I don't think that approach would bear much fruit--FO wouldn't really add any value to what's already in the PDF simply because an FO instance is really a formatted document--the fact that it's in XML syntax doesn't really mean anything for a to-XML process.


It would almost certainly be more effective to use traditional data conversion approaches to getting the data into XML.

In any case, the content of a PDF document is quite accessible using available PDF libraries such as PJ and the Adobe PDF library. If you could convert the PDF to FO you could just as easily convert it to some specific DTD--the problem is essentially the same and has the same level of difficulty.

But it's also the case that recognizing semantic structures from the composed page as printed is usually easier than recognizing them from the raw PDF data stream--that's because something like a bold indented title only has one visual representation but could be defined in the PDF stream in any number of ways within the same PDF document, many of which would quite difficult to recognize hueristically. It's not uncommon, for example, to find a PDF page that's defined as a sequence of text commands, each containing one character that is positioned independently of all the other characters. That makes it very difficult to determine things like word boundaries, line boundaries, and so on, without actually doing the rendering those text commands define. At that point, you might as well scan the rendition. You could, I suppose, use the PDF text content as a post-scan quality check, but that's just a frill.

That is, it's much easier for an OCR system to recognize a structural title by its formatting than it is for a PDF interpreter to recognize a structural title by the sequence of PDF commands that happen to have been used to render it.

Of course, if you have tagged PDF (PDF with embedded markup), things may be a little easier, but the use of tagged PDF is, I think, pretty rare, and in any case, there are numerous limitations in what you can do with it in any case.

Cheers,

Eliot
--
W. Eliot Kimber, eliot@xxxxxxxxxx
Consultant, ISOGEN International

1016 La Posada Dr., Suite 240
Austin, TX  78752 Phone: 512.656.4139


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread