Re: [xsl] XML to both ODF and OOXML conversion

Subject: Re: [xsl] XML to both ODF and OOXML conversion
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Thu, 15 Oct 2009 12:52:26 -0400

At 03:53 PM 10/14/2009, XMLizer wrote:
Well ODF and OOXML have pretty different structure (ODF is block
oriented and OOXML is run oriented)

I would go for option 1 waiting for tools to do option 3

That's funny that you didn't propose option 4 (XML --> OOXML --> ODF)

Probably Option 2 could be interesting with a recognized intermediate
format (DocBook, TEI or DITA) but I'm not sure there is converter
available yet

I don't know enough about OOXML to address it, but I can say something about the ODF conversion.

I've written an application that makes ODF in two transformation steps. The first creates an "ODF-ready" intermediate format that allows XML-native structures such as arbitrary nesting of block or inline elements. The second converts this format into ODF (actually the second step has an internal pipeline of its own).

Since the hard parts of the conversion are all handled in the second phase, the first is fairly straightforward. (This is good for maintenance, plus it ought to be fairly easy to replace this XSLT with another, from a different source format.) The second phase transform is not so straightforward largely because in the general case, the problems you have to solve tend to be harder than they look: nested structures (sometimes deeply nested) have to be flattened and sometimes split, while whatever relevant semantics are implied by their nesting must be preserved in the result. Fortunately, in XSLT 2.0 this is at least doable. (The advantage of exposing and formalizing an intermediate format is essentially that it gives you a place to manage and constrain exactly what sorts of semantics will be handled in the second phase, thus reducing the impact of combinatorial explosion of N-ary relations between element types and attribute values in the source.)

The experience of building this has given me, I think, some insight into the problem:

1. This two-phase approach does work, and it simplifies the problem of getting from descriptive tagging into word processors. The intermediate format, however (as I think Eliot suggested) does have to be designed for the purpose. An arbitrary descriptive format such as NLM or Docbook really won't do -- although it could work nicely as the original format (and so also as a post-editorial, "pre-production" format for getting to the intermediate format).

In fact, it works well enough that I think there is real potential for ODF applications like OpenOffice to expose such an intermediate format as a more robust and easier option for interfacing with native XML formats than ODF itself.

2. Round-tripping is another entirely different kettle of fish. One reason this approach works is that you can map highly descriptive elements (such as document metadata) into formatting analogues -- when (and only when) you want to -- in effect making a page design for them. Getting back the other way isn't going to be easy, even if the requirements can be defined in such a way to make it technically feasible. Plus, there are a myriad of tricky technical problems with structural inferencing, etc.

Not that round-tripping won't eventually be done. But I think we may have to see significant evolution in word processors before it gets to be nice, transparent and stress-free. So far, word processors have been almost entirely beholden to the requirement to be what I call "paintbrush applications", which are very valuable for many purposes -- just not for creating and managing structurally sound documents for semantic interchange.


Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.      
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

Current Thread