Dear XSLTians,
For a troff-to-XML/Unicode conversion I've implemented a strategy that 
produces the desired result, but that does the conversion to Unicode 
slowly, and I would be grateful for advice about improving the efficiency.
I handle the conversion of the structural marked up XML first, and I 
wind up with all of my XML tagging in place, but the text strings use 
troff escape sequences, rather than Unicode. The text is almost all 
medieval Cyrillic, and most of the Cyrillic characters are represented 
in the troff with sequences of several ascii characters. The strategy I 
adopted to convert the troff character encoding to Unicode was to create 
a mapping file for the troff-to-Unicode character correspondences. 
Here's a snippet (a single mapping correspondence):
<mapping>
<troff>\(qb</troff>
<unicode>P1</unicode>
</mapping>
I then wrote an XSLT script that reads the file of mappings and 
generates another XSLT script that will do the actual remapping. Here's 
a snippet of the generated XSLT script; this snippet is taken from 
within a template rule for text() nodes (the named template that gets 
called follows the snippet):
<xsl:variable name="temp52">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp51"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?s</xsl:with-param>
<xsl:with-param name="unicode">Q	</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="temp53">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp52"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?c</xsl:with-param>
<xsl:with-param name="unicode">R</xsl:with-param>
</xsl:call-template>
</xsl:variable>
. . .
<xsl:template name="replacement">
<xsl:param name="text"/>
<xsl:param name="troff"/>
<xsl:param name="unicode"/>
<xsl:value-of select="replace($text, $troff, $unicode)"/>
</xsl:template>
The program logic is that for each text node, the template rule passes 
the textual contents to a replace() function that replaces a troff 
encoding with the corresponding Unicode value. The replace() function is 
then called again with the next mapping. The textual content is passed 
along through repeated remappings, and when it emerges on the other end, 
all multi-character troff sequences have been replaced with Unicode 
characters. There are 64 such mappings. I use replace() only for places 
where a multi-character troff string has to be replaced by a single 
Unicode character; at the end of the series of calls to replace() I use 
translate() to do the remaining one-to-one mappings (there are 
approximately 50 of them) in a single function call. The order of the 
mappings is (obviously) important; I need to remap longer strings before 
shorter ones, since the shorter ones may be subcomponents of the longer 
ones. In particular, I can remap individual characters (the one-to-one 
mappings) only after I've taken care of all of the many-to-one ones.
The input file (XML with troff character coding instead of the desired 
Unicode) is 6.7MB and the Unicode output is 7.8MB. The transformation 
takes approximately five minutes to run, which feels like an eternity, 
but I'm not sure to what extent the execution time reflects the size of 
the input file and the number of replacements that needs to be 
performed, and to what extent it reflects inefficient program design. 
Can anyone suggest a revision that would provide a considerable 
improvement in efficiency (bearing in mind that the XSLT script that 
does the actual character remapping must be generated by XSLT from the 
mappings file)?
Thanks,
David
djbpitt+xml@xxxxxxxx