Subject: [xsl] gibberish-to-unicode conversion From: "Birnbaum, David J" <djbpitt@xxxxxxxx> Date: Mon, 25 Apr 2011 01:51:24 -0400 |
Dear XSL List, Thanks for the quick responses to my inquiry about Unicode conversion. A few thoughts: > The main thing that comes to mind is: Did this need to be done in XSLT? I had thought of doing the job using a general-purpose scripting language, such as Python, and preferred an XSLT approach for the following reasons, the first actual and the second more theoretical: 1. The PUA values in the input could be serialized as raw characters or as numerical character references, the latter in decimal or hex. Matching on the lexical (string) value with a general-purpose scripting language seems as if it might be more complicated than matching with XSLT and XPath, where the different lexical representations would all be recognized as equivalent when the input was parsed prior to transformation. 2. In this project the conversion of the PUA values is unambiguous, which is to say that wherever they occur, they should always be converted to the same Unicode BMP values. Assuming issue #1 above could be resolved, I wouldn't need access to the XML tree to perform the conversion, which means that a general-purpose scripting language would do the job. With a more general solution in mind, though, I was thinking of similar conversion projects where, for example, instead of PUA characters the input XML might use 7-bit ASCII to represent both real 7-bit ASCII values (letters of the Latin script) and, say, Cyrillic, so that <span writing="latin">a</span> would represent a Latin Small Letter A (U+0061) and <span writing="cyrillic">a</span> would also contain a lexical U+0061, but in this context it would be intended to represent (and would need to be converted to) a Cyrillic Small Letter A (U+0430). An XSLT approach lets me use XPath to maintain the state of the writing system, converting text nodes inside an element differently depending on the value of the @writing attribute on the parent element. > Since you appear to be using XSLT 2.0, it seems like character maps would be the > best solution XSLT has to offer ... Yes, I'm using XSLT 2.0, and I had never thought of character maps (which I've never used at all before, so I'm especially grateful for being reminded of their existence). A quick look in Michael Kay's book confirms that a character map would let me write out the markup easily, but as far as I can tell, the value of the @character attribute in an <xsl:output-character > element must be a single character, so in a scenario where I may need to convert, say, "ab" to "x<sup>y</sup>z", I can't specify "ab" as the value of the @character attribute. (This wasn't part of my original spec, but it was one of the additional considerations I introduced at the end, when I was mulling over how to make the solution more generalizable.) I also wonder about the philosophical implications of using a character map (forgive me, but as an academic, I can't think about getting the job done without reflecting on whether I'm doing it The Right Way). Character maps are intended, it appears, to control serialization and as a replacement for output escaping, which may not be properly the business of the XML parser and the XSLT engine, but using them, especially to generate tags, creates an opportunity to produce output that is not well-formed XML. I can be scrupulous about not doing that, of course, but it feels a bit non-XSLTistic. That's not an argument against using a character map when it gets the job done, of course, but I think this may be why I never thought of trying to write out angle brackets and the like directly, and was drawn instead to the <xsl:copy-of> strategy, where what I was copying was well-balanced XML. In any case, converting "ab" to "x<sup>y</sup>z" seems to be the thorniest remaining issue, especially if it has to be used recursively (that is, if the same input string has to be passed through several such mappings), since after the first match the output is no longer just a string, and therefore can't be scanned the same way as the original pure string input. Suggestions welcome, of course, and thanks again to those who responded! Cheers, David djbpitt@xxxxxxxx
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] gibberish-to-unicode conv, Brandon Ibach | Thread | [xsl] XPath expression that generat, Costello, Roger L. |
Re: [xsl] gibberish-to-unicode conv, Brandon Ibach | Date | [xsl] XPath expression that generat, Costello, Roger L. |
Month |