Noel Golding wrote:
One business problem would be to transform already existing pdf document to
xml. FO could be the first step to getting it into an xml schema for the
organization. I would benefit from such a tool.
I don't think that approach would bear much fruit--FO wouldn't really
add any value to what's already in the PDF simply because an FO instance
is really a formatted document--the fact that it's in XML syntax doesn't
really mean anything for a to-XML process.
It would almost certainly be more effective to use traditional data
conversion approaches to getting the data into XML.
In any case, the content of a PDF document is quite accessible using
available PDF libraries such as PJ and the Adobe PDF library. If you
could convert the PDF to FO you could just as easily convert it to some
specific DTD--the problem is essentially the same and has the same level
of difficulty.
But it's also the case that recognizing semantic structures from the
composed page as printed is usually easier than recognizing them from
the raw PDF data stream--that's because something like a bold indented
title only has one visual representation but could be defined in the PDF
stream in any number of ways within the same PDF document, many of which
would quite difficult to recognize hueristically. It's not uncommon, for
example, to find a PDF page that's defined as a sequence of text
commands, each containing one character that is positioned independently
of all the other characters. That makes it very difficult to determine
things like word boundaries, line boundaries, and so on, without
actually doing the rendering those text commands define. At that point,
you might as well scan the rendition. You could, I suppose, use the PDF
text content as a post-scan quality check, but that's just a frill.
That is, it's much easier for an OCR system to recognize a structural
title by its formatting than it is for a PDF interpreter to recognize a
structural title by the sequence of PDF commands that happen to have
been used to render it.
Of course, if you have tagged PDF (PDF with embedded markup), things may
be a little easier, but the use of tagged PDF is, I think, pretty rare,
and in any case, there are numerous limitations in what you can do with
it in any case.
Cheers,
Eliot
--
W. Eliot Kimber, eliot@xxxxxxxxxx
Consultant, ISOGEN International
1016 La Posada Dr., Suite 240
Austin, TX 78752 Phone: 512.656.4139
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list