[xsl] efficiency and replace()

Subject: [xsl] efficiency and replace()
From: David J Birnbaum <djbpitt+xml@xxxxxxxx>
Date: Sun, 10 Sep 2006 11:19:47 -0400
Dear XSLTians,

For a troff-to-XML/Unicode conversion I've implemented a strategy that produces the desired result, but that does the conversion to Unicode slowly, and I would be grateful for advice about improving the efficiency.

I handle the conversion of the structural marked up XML first, and I wind up with all of my XML tagging in place, but the text strings use troff escape sequences, rather than Unicode. The text is almost all medieval Cyrillic, and most of the Cyrillic characters are represented in the troff with sequences of several ascii characters. The strategy I adopted to convert the troff character encoding to Unicode was to create a mapping file for the troff-to-Unicode character correspondences. Here's a snippet (a single mapping correspondence):

<mapping>
<troff>\(qb</troff>
<unicode>P1</unicode>
</mapping>

I then wrote an XSLT script that reads the file of mappings and generates another XSLT script that will do the actual remapping. Here's a snippet of the generated XSLT script; this snippet is taken from within a template rule for text() nodes (the named template that gets called follows the snippet):

<xsl:variable name="temp52">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp51"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?s</xsl:with-param>
<xsl:with-param name="unicode">Q	</xsl:with-param>
</xsl:call-template>
</xsl:variable>
<xsl:variable name="temp53">
<xsl:call-template name="replacement">
<xsl:with-param name="text">
<xsl:value-of select="$temp52"/>
</xsl:with-param>
<xsl:with-param name="troff">\\\(\?c</xsl:with-param>
<xsl:with-param name="unicode">R</xsl:with-param>
</xsl:call-template>
</xsl:variable>
. . .
<xsl:template name="replacement">
<xsl:param name="text"/>
<xsl:param name="troff"/>
<xsl:param name="unicode"/>
<xsl:value-of select="replace($text, $troff, $unicode)"/>
</xsl:template>

The program logic is that for each text node, the template rule passes the textual contents to a replace() function that replaces a troff encoding with the corresponding Unicode value. The replace() function is then called again with the next mapping. The textual content is passed along through repeated remappings, and when it emerges on the other end, all multi-character troff sequences have been replaced with Unicode characters. There are 64 such mappings. I use replace() only for places where a multi-character troff string has to be replaced by a single Unicode character; at the end of the series of calls to replace() I use translate() to do the remaining one-to-one mappings (there are approximately 50 of them) in a single function call. The order of the mappings is (obviously) important; I need to remap longer strings before shorter ones, since the shorter ones may be subcomponents of the longer ones. In particular, I can remap individual characters (the one-to-one mappings) only after I've taken care of all of the many-to-one ones.

The input file (XML with troff character coding instead of the desired Unicode) is 6.7MB and the Unicode output is 7.8MB. The transformation takes approximately five minutes to run, which feels like an eternity, but I'm not sure to what extent the execution time reflects the size of the input file and the number of replacements that needs to be performed, and to what extent it reflects inefficient program design. Can anyone suggest a revision that would provide a considerable improvement in efficiency (bearing in mind that the XSLT script that does the actual character remapping must be generated by XSLT from the mappings file)?

Thanks,

David
djbpitt+xml@xxxxxxxx

Current Thread