[xsl] Advice on dictionary conversion

Subject: [xsl] Advice on dictionary conversion
From: Ciarán Ó Duibhín <ciaran@xxxxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 17 Jan 2011 20:14:20 -0000
I wish to convert a bilingual dictionary from MS-Word format to "properly"-tagged XML, and I hope I may ask for some comment on the feasibility of this, using XSLT or otherwise.

First I found several programs which automatically convert the Word files to FO:XSL, either from .doc or .rtf. My preferred one of those I examined is the Novosoft converter (http://www.rtf-to-xml.com/). I painlessly converted the entire letter D using their online interface.

Now I have to replace the presentational tags by tags like <HEADWORD>, <EXPLANATION>, <EXAMPLE> etc. I tried doing this manually, but it is not practical. Besides, I have to start from scratch again for each new letter of the alphabet. I have zero experience of XSLT, but it seemed that an XSLT program might be what was needed. I started with XRay2 (really nice for a beginner in some ways) and have now moved on to the Essential XML Editor with Saxon. But progress has been minimal.

The main problem is my ignorance of XSLT, although I am an experienced general programmer. A particular difficulty is that "italics" (for example) might be used for more than one part of the dictionary entry. However the choice of which tag to replace it with might well be decided by the target DTD (if I were to formulate it). Is this an example of what people sometimes refer to on this list as "schema-aware XSLT"? If so, I have no idea how to make my XSLT schema-aware.

Another problem is that the dictionary contains quite a few "mistakes" which are all but invisible in Word, eg. a single space might be inadvertently bolded in an unbold field. This sort of thing is faithfully copied by a converter and complicates the starting XML unnecessarily, of course.

I would be grateful for advice as to how best to proceed. I took on this job as a favour, hoping it would help me to learn something of these technologies, but it seems now there is too much to learn on one's own in any reasonable short space of time (XSLT is not for amateurs :-(. Perhaps I should advise to have the job done professionally. Unless there is something I am missing...

On a related matter, I have recently discovered LIFT as a particular XML format for lexicographical work (http://code.google.com/p/lift-standard/) Any experience of that as a target format for XSLT would also be of interest.

Thanks,
Ciaran S Duibhmn.


Current Thread