Deirde,
Your project sounds very ambitious. Up-conversion is a challenging and
fascinating business, which we're all going to learn much more about. You
have several conference papers' worth of material here, I bet.
Briefly,
At 08:15 PM 8/12/2004, you wrote:
But I've been thinking, based on the comments from the list, that a better
process might be eliminating the perl script entirely.
Maybe: but you'll need something at least as good to do the work it's
doing, and Perl is really good at regular-expressions and string processing
generally.
(Personally I might have tried it in Python, but that's mainly because I
can count the lines of Perl I've written in my life on one hand. Of course,
I can count in binary on my hands, which gets me higher than five.)
Now it has some regexp support, XSLT 2.0 should be at least a credible
option here, but its features have yet to be stress-tested TMK and tools
support is still somewhat up in the air. (I believe Mike Kay is speaking on
this very topic at XML 2004 this November in Washington DC.)
A split-down-the-middle option could be to write a little function library
in the language of your choice to do the upconversion string-processing,
and call out to it from your XSLT using extension functions. (This is what
I kind of imagined would happen five years ago, but it turns out
processor-dependent extension functions are unfashionable these days.)
I'm not sure I'd
want to eliminate the intermediate XML file, though. There have been
times when I've needed to tweak it. For example, I have old files with
smart quotes not saved in UTF-8, and the perl script barfs on UTF-8 files,
so I do the XML conversion, open the file and re-save the XML as UTF-8.
I think having the intermediate format will prove to be good design in any
case. I was just reading that the complexity of a solution to a problem
generally increases in proportion to the square of the size of the problem
space, which is why breaking problems down into pieces works so well.
(Don't ask me why those guys think this: it didn't say.)
Option 3 seems to be ruled out based on my current toolchain
(apache-FOP), which probably eliminates #2 as well. (I could easily be
wrong on this)
Apache Xalan-J has support for a node-set function, so you could use option
2 if you wanted. It will even recognize it in the exslt.org namespace,
which is nice.
Options 1 and 4 seem most like what the current process is. Currently, a
new XML file is generated only if the timestamp is less than the timestamp
of the text file it's transformed from.
So, my question (you knew there was one): can someone give me a
description of how to accomplish #4, given the workflow I've got, using
something like Saxon? I see that it's an XSLT processor, but I'm don't get
the map of how all the pieces fit together. Right now, I know (after
having looked) that I'm using xalan for the simple reason that it came
with my apache-fop install.
Saxon is well-liked by developers (it runs well, it's conformant, and it
has good error messages), and can be switched in for Xalan in your
toolchain if you prefer it. Saxon also supports exslt:node-set, so you can
use option #2 with it as well.
As I mentioned, it has an extension attribute, saxon:next-in-chain, that
can be invoked for pipelining. IIRC it passes SAX events between processor
invocations (Mike?), so it's much faster than writing a file and reparsing,
though perhaps not quite as fast as passing unserialized trees, as options
2 and 3 would do.
I am reasonably sure Xalan offers similar features, however, or the Cocoon
framework does.
I'd also eventually like to get a decent RTF output. Standard manuscript
prose is not terribly complex, so something that supported basic features
should suffice for that. Unfortunately, the commercial options are too
expensive for the intended audience. Is jfop likely to be my best
available option?
I'd be interested to hear myself from the list on this question. I haven't
yet myself seen a really nice route to RTF. I think two passes to this
(analogous to the way IBM deployed a "TeXML" which could be targeted as a
route to TeX) might be the best way to do it: have yet another tag set that
describes only the formatting primitives supported by RTF and a utility
stylesheet to make RTF out of that. Or use XSL-FO, if any of the formatters
can make decent RTF yet.
I hope this helps!
Wendell
======================================================================
Wendell Piez mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================