[xsl] gibberish-to-unicode conversation

Subject: [xsl] gibberish-to-unicode conversation
From: "Birnbaum, David J" <djbpitt@xxxxxxxx>
Date: Sat, 23 Apr 2011 22:27:21 -0400
Dear XSLT list,

I would be grateful for some advice about how to conceptualize a project that
involves remapping the textual characters in an XML document using XSLT. Here
are the details:

Input: XML with text nodes that are encoded using (or, rather, abusing) the
Unicode Private Use Area (PUA). The original content creators ignored the
entire existing Unicode inventory and mapped every text character to something
in the PUA. (They had their reasons, but they were misguided. Damage done.) In
most cases their individual PUA characters have individual counterparts in the
Unicode Base Multilingual Plane (BMP). In some cases, though, what they
encoded as an individual PUA character needs to be replaced by more than one
BMP character, and in other cases the replacement also has to incorporate
markup. See below for details.

Desired output: XML with the PUA text remapped to appropriate Unicode BMP
values, with any necessary markup inserted.

Mappings: There are at least three types of relationships (mappings) between
the PUA text in the original and the Unicode BMP needed in the output:

1. One to one. A single PUA character should be replaced by a single Unicode
BMP character.

2. One to many. A single PUA character should be replaced by two or more
Unicode BMP characters. No additional marked is inserted.

3. Markup mapping. One PUA character is remapped to one or more Unicode BMP
characters, but with inserted markup (see example below).

The mapping file that specifies what needs to be replaced by what looks like
the following:

<mappings>
  <mapping>
    <original>a</original>
    <unicode>x</unicode>
  </mapping>
  <!-- more one-to-one mappings -->
  <many>
    <mapping>
      <original>b</original>
      <unicode>yz</unicode>
    </mapping>
    <!-- more one-to-many mappings -->
  </many>
  <markup>
    <mapping>
      <original>p</original>
      <unicode>q<sup>r</sup></unicode>
    <mapping>
    <!-- more markup mappings -->
  </markup>
</mappings>

Individual <mapping> elements directly under the root <mappings> element are
one-to-one. The one-to-many <mapping> elements are grouped under <many>, which
is under <mappings>. The mappings that insert markup are grouped under
<markup>, which is also under <mappings>.

Possible strategies:

1. One to one. Concatenate the values into strings and use them in
translate(), e.g.:

<xsl:variable name="originals"
select="doc('mappings.xml')/mappings/mapping/original"/>
<xsl:variable name="replacements"
select="doc('mappings.xml')/mappings/mapping/unicode"/>

and then, later, after doing the more complicated type-2 and type-3
replacements, pass the output of the last of those replacements to:

translate($text,$originals,$replacements)

2. One to many. Use replace() recursively, iterating over the one-to-many
mapping pairs, and feeding the output of the final replace() operation into
the translate() function above as the value of $text.

These two pieces play well together, but the markup replacements (type 3)
complicate the picture. The first strategy that occurred to me was to start
the conversion with these, tokenize the text() node as individual characters,
look each character up in the markup/mapping/original elements, and use
<xsl:copy-of> to effect the replacement. That is, pass the initial input
text() node to:

  <xsl:variable name="characters" select="for $i in string-to-codepoints(.)
codepoints-to-string($i)"/>

This gives me a sequence of individual PUA characters. For each one I then do
the following:

<xsl:for-each select="$characters">
  <xsl:choose>
    <xsl:when test=". = document('mappings.xml')//markup/mapping/original">
      <xsl:copy-of
        select="document('bbl-unicode.xml')//markup/mapping[original eq
current()]/unicode/node()"/>
    </xsl:when>
    <xsl:otherwise>
      <xsl:value-of select="."/>
    </xsl:otherwise>
  </xsl:choose>
</xsl:for-each>

This is the first time I've ever seen <xsl:copy-of> used to copy something
other than the context node (or its children) in the document being
transformed; in this case it's copying the well-balanced XML from inside the
<unicode> element in mappings.xml, a different document. Is this as unusual as
I think, or have I just led a sheltered life? Or is it unusual because it's
wrong-headed?

In any case, once I seized on <xsl:copy-of> as a possible solution to
introducing markup as part of the replacement, I realized that I could also
have used it for the many-to-one mappings, since <xsl:copy-of
select="unicode/node()"/> returns the same result as <xsl:value-of
select="unicode"/> when <unicode> happens to contain only a single text node,
as it does in the one-to-many mappings. And the same would have worked for the
one-to-one mappings, as well, of course.

This raises another question about another possible complication. A more
general and robust solution would (should) also support many-to-many mappings,
possibly with inserted markup. In that case I can't just tokenize the string
into characters because sometimes a sequence of two or more characters will be
needed as the input value for the mapping pair. Is there a good way to cater
to that eventuality? <xsl:analyze-string> is unappealing because I'm not sure
how I would use it recursively, since once I've done a replacement that
inserts markup, I don't have a string any more, and I can't just pass the
result to another iteration of <xsl:analyze-string> without having it
converted to a string, with the loss of the markup I inserted.

My question, then, after this long-winded exposition, is: How should I have
conceptualized this task? I broke it down into three types of replacements and
adopted a different strategy for each, and I started with the easiest (the
one-to-one replacements). I then realized that the problem was more general
(there are other possible types of mappings), and also that there were
multiple ways to deal with some of the types of mapping. Finally, the problem
begins with a text() node, but once a replacement inserts some markup, it's no
longer just a text() node, so a recursive strategy that requires with a
pristine text() node as input may become inapplicable as the replacements
accrue.

On the one hand, this is a one-off transformation for a particular project,
and once it's done I'll never have to run it again, so efficiency of execution
isn't a high priority. On the other hand,  these kinds of gibberish-to-unicode
remappings are very common in my world (legacy documents in unusual writing
systems), and I really should think about the general problem type, instead of
cobbling together a new ad hoc solution every time a new project crosses my
desk. I'd be grateful for any advice.

Cheers,

David
djbpitt@xxxxxxxx

Current Thread