Subject: Re: [xsl] gibberish-to-unicode conversation From: "Christopher R. Maden" <crism@xxxxxxxxx> Date: Sat, 23 Apr 2011 22:34:53 -0400 |
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 04/23/2011 10:27 PM, Birnbaum, David J wrote: > My question, then, after this long-winded exposition, is: How should > I have conceptualized this task? I broke it down into three types of > replacements and adopted a different strategy for each, and I started > with the easiest (the one-to-one replacements). I then realized that > the problem was more general (there are other possible types of > mappings), and also that there were multiple ways to deal with some > of the types of mapping. Finally, the problem begins with a text() > node, but once a replacement inserts some markup, it's no longer just > a text() node, so a recursive strategy that requires with a pristine > text() node as input may become inapplicable as the replacements > accrue. > > On the one hand, this is a one-off transformation for a particular > project, and once it's done I'll never have to run it again, so > efficiency of execution isn't a high priority. On the other hand, > these kinds of gibberish-to-unicode remappings are very common in my > world (legacy documents in unusual writing systems), and I really > should think about the general problem type, instead of cobbling > together a new ad hoc solution every time a new project crosses my > desk. I'd be grateful for any advice. The main thing that comes to mind is: Did this need to be done in XSLT? While itbs certainly possible, this very much smells like a job for Perl (or Python, if you prefer) to me. That makes the many-to-many case easier, as well. If you were to run into a particular (ab)use of encoding repeatedly, you could even implement it as an encoding module in Perl, and then just read the input as being in that encoding and re-write it in UTF-8. That all said, I think your approach was sound, insofar as XSLT was the tool to use. ~Chris - -- Chris Maden, text nerd <URL: http://crism.maden.org/ > bThose in power write the history, while those who suffer write the songs.b b Frank Harte GnuPG Fingerprint: C6E4 E2A9 C9F8 71AC 9724 CAA3 19F8 6677 0077 C319 -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAk2zjE0ACgkQGfhmdwB3wxl0GQCgvShXQhgMoyfMKXVpO0UgCYRw O5wAoK56qcVpL6Lo8ZcJLXswxm5kuE+K =2u3v -----END PGP SIGNATURE-----
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] gibberish-to-unicode conversa, Birnbaum, David J | Thread | Re: [xsl] gibberish-to-unicode conv, Brandon Ibach |
[xsl] gibberish-to-unicode conversa, Birnbaum, David J | Date | Re: [xsl] gibberish-to-unicode conv, Brandon Ibach |
Month |