Subject: Re: [xsl] Advice on dictionary conversion From: Liam R E Quin <liam@xxxxxx> Date: Tue, 18 Jan 2011 12:55:09 -0500 |
[oops, forgot to send this!] On Mon, 2011-01-17 at 20:14 +0000, Ciaran S Duibhmn wrote: > I wish to convert a bilingual dictionary from MS-Word format to > "properly"-tagged XML, and I hope I may ask for some comment on the > feasibility of this, using XSLT or otherwise. I've done a lot of this in the past, too, and still do sometimes. As others have said, it can be very time-consuming. I'd be tempted to try to go via the new XML Word format, although, since I don't have access right now to a recent copy of MS Word, I don't know how good it is in practice. I'm guessing, at least as ugly as the OpenOffice format, *but*, the good thing is, all the information will be there. Most conversion programs will occasionally make mistakes and lose some formatting. Again as others have said, a pipeline of small tasks. Tie them together with Make or ant or a shell script that does a check at each stage and quits on errors. Mine are usually (after a practice of Kate Hamilton) called "runme" if they are shell scripts, and "makefile" for use with make. I use, xmllint --noout somefile.xml || exit 1 in shell scripts, after each stage. Typical tasks might be * convert the Word file :-) * normalise, e.g. to remove irrelevant output from the converter, and to make the next step as easy as possible... * identify the start of each article or entry in the dictionary, and the primary word or phrase defined; the output of this should have each entry with <entry><head>Word being defined</head> more stuff here </entry> * add an XML id attribute to identify each entry (I often end up doing this step in Perl, using a hash, although streaming XSLT 3 with grouping will make it easier in the future I expect) * identify any dictionary entries that are out of order -- either one of the scripts went wrong (most likely), or add an exception to the checker, or, if it's an option, move the entry in the dictionary to the right place. Important - if you for any reason change the Word file, keep the original!!!! The same applies if you use an online converter and then edit its output by hand. > The main problem is my ignorance of XSLT, although I am an experienced > general programmer. A particular difficulty is that "italics" (for example) > might be used for more than one part of the dictionary entry. However the > choice of which tag to replace it with might well be decided by the target > DTD (if I were to formulate it). Is this an example of what people > sometimes refer to on this list as "schema-aware XSLT"? If so, I have no > idea how to make my XSLT schema-aware. It's more likely an example of being context sensitive, and you might end up with logic like, if we're in the body of an entry { if the word "Example:" or "Examples:" occurs in bold after this italic element { we have notyet reached the examples, so it's something else } else { it's probably an example } } else { it's a qualifier on the headword, or we're defining a phrase } You may find it helpful to use markup like, <i role="example">...</i> or <i role="example" why="script6:rule14 inExamples">...</i> The second form can make it *much* easier to debug everything. > Another problem is that the dictionary contains quite a few "mistakes" which > are all but invisible in Word, eg. a single space might be inadvertently > bolded in an unbold field. This sort of thing is faithfully copied by a > converter and complicates the starting XML unnecessarily, of course. One possibility is to fix some such errors in a COPY of the Word file -- I have "input-handedited.txt" for one of my conversion projects. If there are many such errors, maybe when you are more familiar with the technologies you can write a script to fix most of them. > I would be grateful for advice as to how best to proceed. I took on this > job as a favour, hoping it would help me to learn something of these > technologies, It can be a lot of work. Watch out! But it can be very rewarding, too. Along with all the low-level advice, don't forget to be very clear about the goal -- is it to make a semantically-marked up database for querying, e.g. for linguists to use, or to makes omething that looks more or less the same when you print it out. Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org www.advogato.org
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Advice on dictionary conv, Emmanuel Bégué | Thread | Re: [xsl] Advice on dictionary conv, Michael Kay |
Re: [xsl] grouping xhtml title with, Wendell Piez | Date | Re: [xsl] White space treatment iss, Karl Stubsjoen |
Month |