Re: [xsl] Advice on dictionary conversion

Subject: Re: [xsl] Advice on dictionary conversion
From: dvint@xxxxxxxxx
Date: Mon, 17 Jan 2011 13:34:20 -0800
Welcome to world of MS-Word conversion. First do you have a set of tags
you plan on converting to? You will need that no matter who does the work.

XSLT is certainly one of the tools you would use. Your example of italics
in more than one use requires some sort of context with the italics. So
you need to look at <italics> when the parent is  a <explanation> or
<italics> when contained in <example>. You don't provide an example,
hopefully you are getting some nesting of tags to help you with the
conversion.

The next issue is usually how accurate the authors were in using the Word
styles. If in one case they use styles and the next use some form of
character tagging you will run into other problems.

You may only get so far with XSLT and then have to switch to Perl, or
maybe use Perl first to fix something and then run XSLT. Another approach
is not to try and do everything at once. This is the bucket brigade
approach. You write small stylesheets that do one (or a few) small tasks,
with the results of that output being run through another stylesheet that
gets you closer to the final result. You string a bunch of these together
until you get what you want.

You sometimes get to a point where it is then time to do the final cleanup
by hand. If you have access to a native XML editor (Arbortext, Oxygen,
XMetaL), you can use it to help fix the remaining problems or point out
issues you need to address along the conversion path.

Your task is both to learn XSLT as well as the quirks of your content

Also note that you don't have to process each letter indiviudally. Sounds
like the Word files were split up this way. You can contatenate them
together and just wrap the resulting single file with one master tag like
this:
<myroot>
  ... contents of the file ...
</myroot>

That should get you a single file that you can process.

..dan

> I wish to convert a bilingual dictionary from MS-Word format to
> "properly"-tagged XML, and I hope I may ask for some comment on the
> feasibility of this, using XSLT or otherwise.
>
> First I found several programs which automatically convert the Word files
> to
> FO:XSL, either from .doc or .rtf.  My preferred one of those I examined is
> the Novosoft converter (http://www.rtf-to-xml.com/).  I painlessly
> converted
> the entire letter D using their online interface.
>
> Now I have to replace the presentational tags by tags like <HEADWORD>,
> <EXPLANATION>, <EXAMPLE> etc.  I tried doing this manually, but it is not
> practical.  Besides, I have to start from scratch again for each new
> letter
> of the alphabet.  I have zero experience of XSLT, but it seemed that an
> XSLT
> program might be what was needed.  I started with XRay2 (really nice for a
> beginner in some ways) and have now moved on to the Essential XML Editor
> with Saxon.  But progress has been minimal.
>
> The main problem is my ignorance of XSLT, although I am an experienced
> general programmer.  A particular difficulty is that "italics" (for
> example)
> might be used for more than one part of the dictionary entry.  However the
> choice of which tag to replace it with might well be decided by the target
> DTD (if I were to formulate it).  Is this an example of what people
> sometimes refer to on this list as "schema-aware XSLT"?  If so, I have no
> idea how to make my XSLT schema-aware.
>
> Another problem is that the dictionary contains quite a few "mistakes"
> which
> are all but invisible in Word, eg. a single space might be inadvertently
> bolded in an unbold field.  This sort of thing is faithfully copied by a
> converter and complicates the starting XML unnecessarily, of course.
>
> I would be grateful for advice as to how best to proceed.  I took on this
> job as a favour, hoping it would help me to learn something of these
> technologies, but it seems now there is too much to learn on one's own in
> any reasonable short space of time (XSLT is not for amateurs :-(.  Perhaps
> I
> should advise to have the job done professionally.  Unless there is
> something I am missing...
>
> On a related matter, I have recently discovered LIFT as a particular XML
> format for lexicographical work (http://code.google.com/p/lift-standard/)
> Any experience of that as a target format for XSLT would also be of
> interest.
>
> Thanks,
> Ciaran S Duibhmn.

Current Thread