Subject: Re: [xsl] gibberish-to-unicode conversation From: Brandon Ibach <brandon.ibach@xxxxxxxxxxxxxxxxxxx> Date: Sat, 23 Apr 2011 23:20:24 -0400 |
Since you appear to be using XSLT 2.0, it seems like character maps would be the best solution XSLT has to offer. For your examples, something like this might work (untested... YMMV): <xsl:output method="xml" encoding="UTF-8" use-character-maps="PUAtoBMP"/> <xsl:character-map name="PUAtoBMP"> <xsl:output-character character="a" string="x"/> <xsl:output-character character="b" string="yz"/> <xsl:output-character character="p" string="q<sup>r</sup>"/> </xsl:character-map> An XSLT to transform your mapping file into a suitable character map should be relatively straightforward. -Brandon :) On Sat, Apr 23, 2011 at 10:27 PM, Birnbaum, David J <djbpitt@xxxxxxxx> wrote: > Dear XSLT list, > > I would be grateful for some advice about how to conceptualize a project that involves remapping the textual characters in an XML document using XSLT. Here are the details: > > Input: XML with text nodes that are encoded using (or, rather, abusing) the Unicode Private Use Area (PUA). The original content creators ignored the entire existing Unicode inventory and mapped every text character to something in the PUA. (They had their reasons, but they were misguided. Damage done.) In most cases their individual PUA characters have individual counterparts in the Unicode Base Multilingual Plane (BMP). In some cases, though, what they encoded as an individual PUA character needs to be replaced by more than one BMP character, and in other cases the replacement also has to incorporate markup. See below for details. > > Desired output: XML with the PUA text remapped to appropriate Unicode BMP values, with any necessary markup inserted. > > Mappings: There are at least three types of relationships (mappings) between the PUA text in the original and the Unicode BMP needed in the output: > > 1. One to one. A single PUA character should be replaced by a single Unicode BMP character. > > 2. One to many. A single PUA character should be replaced by two or more Unicode BMP characters. No additional marked is inserted. > > 3. Markup mapping. One PUA character is remapped to one or more Unicode BMP characters, but with inserted markup (see example below). > > The mapping file that specifies what needs to be replaced by what looks like the following: > > <mappings> > <mapping> > <original>a</original> > <unicode>x</unicode> > </mapping> > <!-- more one-to-one mappings --> > <many> > <mapping> > <original>b</original> > <unicode>yz</unicode> > </mapping> > <!-- more one-to-many mappings --> > </many> > <markup> > <mapping> > <original>p</original> > <unicode>q<sup>r</sup></unicode> > <mapping> > <!-- more markup mappings --> > </markup> > </mappings> > > Individual <mapping> elements directly under the root <mappings> element are one-to-one. The one-to-many <mapping> elements are grouped under <many>, which is under <mappings>. The mappings that insert markup are grouped under <markup>, which is also under <mappings>. > > Possible strategies: > > 1. One to one. Concatenate the values into strings and use them in translate(), e.g.: > > <xsl:variable name="originals" select="doc('mappings.xml')/mappings/mapping/original"/> > <xsl:variable name="replacements" select="doc('mappings.xml')/mappings/mapping/unicode"/> > > and then, later, after doing the more complicated type-2 and type-3 replacements, pass the output of the last of those replacements to: > > translate($text,$originals,$replacements) > > 2. One to many. Use replace() recursively, iterating over the one-to-many mapping pairs, and feeding the output of the final replace() operation into the translate() function above as the value of $text. > > These two pieces play well together, but the markup replacements (type 3) complicate the picture. The first strategy that occurred to me was to start the conversion with these, tokenize the text() node as individual characters, look each character up in the markup/mapping/original elements, and use <xsl:copy-of> to effect the replacement. That is, pass the initial input text() node to: > > <xsl:variable name="characters" select="for $i in string-to-codepoints(.) codepoints-to-string($i)"/> > > This gives me a sequence of individual PUA characters. For each one I then do the following: > > <xsl:for-each select="$characters"> > <xsl:choose> > <xsl:when test=". = document('mappings.xml')//markup/mapping/original"> > <xsl:copy-of > select="document('bbl-unicode.xml')//markup/mapping[original eq current()]/unicode/node()"/> > </xsl:when> > <xsl:otherwise> > <xsl:value-of select="."/> > </xsl:otherwise> > </xsl:choose> > </xsl:for-each> > > This is the first time I've ever seen <xsl:copy-of> used to copy something other than the context node (or its children) in the document being transformed; in this case it's copying the well-balanced XML from inside the <unicode> element in mappings.xml, a different document. Is this as unusual as I think, or have I just led a sheltered life? Or is it unusual because it's wrong-headed? > > In any case, once I seized on <xsl:copy-of> as a possible solution to introducing markup as part of the replacement, I realized that I could also have used it for the many-to-one mappings, since <xsl:copy-of select="unicode/node()"/> returns the same result as <xsl:value-of select="unicode"/> when <unicode> happens to contain only a single text node, as it does in the one-to-many mappings. And the same would have worked for the one-to-one mappings, as well, of course. > > This raises another question about another possible complication. A more general and robust solution would (should) also support many-to-many mappings, possibly with inserted markup. In that case I can't just tokenize the string into characters because sometimes a sequence of two or more characters will be needed as the input value for the mapping pair. Is there a good way to cater to that eventuality? <xsl:analyze-string> is unappealing because I'm not sure how I would use it recursively, since once I've done a replacement that inserts markup, I don't have a string any more, and I can't just pass the result to another iteration of <xsl:analyze-string> without having it converted to a string, with the loss of the markup I inserted. > > My question, then, after this long-winded exposition, is: How should I have conceptualized this task? I broke it down into three types of replacements and adopted a different strategy for each, and I started with the easiest (the one-to-one replacements). I then realized that the problem was more general (there are other possible types of mappings), and also that there were multiple ways to deal with some of the types of mapping. Finally, the problem begins with a text() node, but once a replacement inserts some markup, it's no longer just a text() node, so a recursive strategy that requires with a pristine text() node as input may become inapplicable as the replacements accrue. > > On the one hand, this is a one-off transformation for a particular project, and once it's done I'll never have to run it again, so efficiency of execution isn't a high priority. On the other hand, these kinds of gibberish-to-unicode remappings are very common in my world (legacy documents in unusual writing systems), and I really should think about the general problem type, instead of cobbling together a new ad hoc solution every time a new project crosses my desk. I'd be grateful for any advice. > > Cheers, > > David > djbpitt@xxxxxxxx
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] gibberish-to-unicode conv, Christopher R. Maden | Thread | [xsl] gibberish-to-unicode conversi, Birnbaum, David J |
Re: [xsl] gibberish-to-unicode conv, Christopher R. Maden | Date | [xsl] gibberish-to-unicode conversi, Birnbaum, David J |
Month |