Re: [xsl] gibberish-to-unicode conversation

Subject: Re: [xsl] gibberish-to-unicode conversation
From: "Christopher R. Maden" <crism@xxxxxxxxx>
Date: Sat, 23 Apr 2011 22:34:53 -0400
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 04/23/2011 10:27 PM, Birnbaum, David J wrote:
> My question, then, after this long-winded exposition, is: How should
> I have conceptualized this task? I broke it down into three types of
> replacements and adopted a different strategy for each, and I started
> with the easiest (the one-to-one replacements). I then realized that
> the problem was more general (there are other possible types of
> mappings), and also that there were multiple ways to deal with some
> of the types of mapping. Finally, the problem begins with a text()
> node, but once a replacement inserts some markup, it's no longer just
> a text() node, so a recursive strategy that requires with a pristine
> text() node as input may become inapplicable as the replacements
> accrue.
>
> On the one hand, this is a one-off transformation for a particular
> project, and once it's done I'll never have to run it again, so
> efficiency of execution isn't a high priority. On the other hand,
> these kinds of gibberish-to-unicode remappings are very common in my
> world (legacy documents in unusual writing systems), and I really
> should think about the general problem type, instead of cobbling
> together a new ad hoc solution every time a new project crosses my
> desk. I'd be grateful for any advice.

The main thing that comes to mind is: Did this need to be done in XSLT?
 While itbs certainly possible, this very much smells like a job for
Perl (or Python, if you prefer) to me.  That makes the many-to-many case
easier, as well.

If you were to run into a particular (ab)use of encoding repeatedly, you
could even implement it as an encoding module in Perl, and then just
read the input as being in that encoding and re-write it in UTF-8.

That all said, I think your approach was sound, insofar as XSLT was the
tool to use.

~Chris
- --
Chris Maden, text nerd  <URL: http://crism.maden.org/ >
bThose in power write the history, while those who suffer
 write the songs.b b Frank Harte
GnuPG Fingerprint: C6E4 E2A9 C9F8 71AC 9724 CAA3 19F8 6677 0077 C319
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk2zjE0ACgkQGfhmdwB3wxl0GQCgvShXQhgMoyfMKXVpO0UgCYRw
O5wAoK56qcVpL6Lo8ZcJLXswxm5kuE+K
=2u3v
-----END PGP SIGNATURE-----

Current Thread