[xsl] gibberish-to-unicode conversion

Subject: [xsl] gibberish-to-unicode conversion
From: "Birnbaum, David J" <djbpitt@xxxxxxxx>
Date: Mon, 25 Apr 2011 01:51:24 -0400
Dear XSL List,

Thanks for the quick responses to my inquiry about Unicode conversion. A few
thoughts:

> The main thing that comes to mind is: Did this need to be done in XSLT?

I had thought of doing the job using a general-purpose scripting language,
such as Python, and preferred an XSLT approach for the following reasons, the
first actual and the second more theoretical:

1. The PUA values in the input could be serialized as raw characters or as
numerical character references, the latter in decimal or hex. Matching on the
lexical (string) value with a general-purpose scripting language seems as if
it might be more complicated than matching with XSLT and XPath, where the
different lexical representations would all be recognized as equivalent when
the input was parsed prior to transformation.

2. In this project the conversion of the PUA values is unambiguous, which is
to say that wherever they occur, they should always be converted to the same
Unicode BMP values. Assuming issue #1 above could be resolved, I wouldn't need
access to the XML tree to perform the conversion, which means that a
general-purpose scripting language would do the job. With a more general
solution in mind, though, I was thinking of similar conversion projects where,
for example, instead of PUA characters the input XML might use 7-bit ASCII to
represent both real 7-bit ASCII values (letters of the Latin script) and, say,
Cyrillic, so that <span writing="latin">a</span> would represent a Latin Small
Letter A (U+0061) and <span writing="cyrillic">a</span> would also contain a
lexical U+0061, but in this context it would be intended to represent (and
would need to be converted to) a Cyrillic Small Letter A (U+0430). An XSLT
approach lets me use XPath to maintain the state of the writing system,
converting text nodes inside an element differently depending on the value of
the @writing attribute on the parent element.

> Since you appear to be using XSLT 2.0, it seems like character maps would be
the
> best solution XSLT has to offer ...

Yes, I'm using XSLT 2.0, and I had never thought of character maps (which I've
never used at all before, so I'm especially grateful for being reminded of
their existence). A quick look in Michael Kay's book confirms that a character
map would let me write out the markup easily, but as far as I can tell, the
value of the @character attribute in an <xsl:output-character > element must
be a single character, so in a scenario where I may need to convert, say, "ab"
to "x<sup>y</sup>z", I can't specify "ab" as the value of the @character
attribute. (This wasn't part of my original spec, but it was one of the
additional considerations I introduced at the end, when I was mulling over how
to make the solution more generalizable.) I also wonder about the
philosophical implications of using a character map (forgive me, but as an
academic, I can't think about getting the job done without reflecting on
whether I'm doing it The Right Way). Character maps are intended, it appears,
to control serialization and as a replacement for output escaping, which may
not be properly the business of the XML parser and the XSLT engine, but using
them, especially to generate tags, creates an opportunity to produce output
that is not well-formed XML. I can be scrupulous about not doing that, of
course, but it feels a bit non-XSLTistic. That's not an argument against using
a character map when it gets the job done, of course, but I think this may be
why I never thought of trying to write out angle brackets and the like
directly, and was drawn instead to the <xsl:copy-of> strategy, where what I
was copying was well-balanced XML.

In any case, converting "ab" to "x<sup>y</sup>z" seems to be the thorniest
remaining issue, especially if it has to be used recursively (that is, if the
same input string has to be passed through several such mappings), since after
the first match the output is no longer just a string, and therefore can't be
scanned the same way as the original pure string input. Suggestions welcome,
of course, and thanks again to those who responded!

Cheers,

David
djbpitt@xxxxxxxx

Current Thread