Re: [xsl] Re: Any Doc to XML converter ?

Subject: Re: [xsl] Re: Any Doc to XML converter ?
From: "Michael Beddow" <mbnospam@xxxxxxxxxxx>
Date: Thu, 21 Jun 2001 09:35:22 +0100
On 20 Jun 2001 Peter Flynn wrote:

> Which may very well be true, but the output is largely garbage.
> This whole discussion misses the major points:

You mean, "the output (probably) doesn't contain a meaningful
representation of the document's structure". Correct, but who claimed
that it did? That doesn't make it garbage. The first stage in brewing
Guinness doesn't result in something anyone would want to drink in their
local pub, but it doesn't get pumped into the sea: it gets processed
into something more useful.
>   1) Iff your Word document is formatted 100% exclusively with
>      named styles, robust conversion to meaningful XML is easily
>      possible with a number of packages, eg Enigma's DynaTag.

Or, if it really is formatted that way, with free extensions to products
you've already licensed, as per the program in question, or, better
still, with completely free ones like OpenOffice. Further proprietary
tools not needed.

>   2) If your Word document uses arbitrary manual styling, no
>      amount of footling around with conversions is going to
>      produce anything other than an XML-syntax'd representation
>      of all the styles.

Again, nobody disputes that, but nobody was claiming anything different.

>     You still have to undertake the hardest
>      part, which is interpreting all the styling cruft into some
>      meaningful markup.

Not quite. You don't *just* have the "styling cruft". You have, unless
you're very unlucky, various clues in the original document about how
it's articulated. Devise a system that identifies those clues and uses
them to rewrite the "cruft" and you're on your way. And yes, it is
"hard", but it can be automated, and without proprietary tools. Depends
a lot, obviously, on the input document. To see what this can look like
in practice, take a look at
Written for an audience of academic medievalists, so contains some
corner-cutting that will make some people here wince, but it illustrates
what I mean in this last remark.

Michael Beddow
XML and the Humanities page:

 XSL-List info and archive:

Current Thread