[xsl] Re: Efficiency and replace()

Subject: [xsl] Re: Efficiency and replace()
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sun, 10 Sep 2006 12:08:20 -0700
Cyrillic characters in the quoted message replaced by spaces as they
caused bin64 encoding to be used by gmail, which was rejected by the
xsl-list server.

Hi David,

If you can send me the actual troff file and the definition of the
mappings I will be interested to look for a better solution.

It seems to me that the str-map template of FXSL 1.x should be more
efficient, as it only performs a single pass on the string and will do
all the replacements.


-- Cheers, Dimitre Novatchev --------------------------------------- Truly great madness cannot be achieved without significant intelligence. --------------------------------------- To invent, you need a good imagination and a pile of junk



On 9/10/06, David J Birnbaum <djbpitt+xml@xxxxxxxx> wrote:
> Dear XSLTians,
>
> For a troff-to-XML/Unicode conversion I've implemented a strategy that
> produces the desired result, but that does the conversion to Unicode
> slowly, and I would be grateful for advice about improving the efficiency.
>
> I handle the conversion of the structural marked up XML first, and I
> wind up with all of my XML tagging in place, but the text strings use
> troff escape sequences, rather than Unicode. The text is almost all
> medieval Cyrillic, and most of the Cyrillic characters are represented
> in the troff with sequences of several ascii characters. The strategy I
> adopted to convert the troff character encoding to Unicode was to create
> a mapping file for the troff-to-Unicode character correspondences.
> Here's a snippet (a single mapping correspondence):
>
> <mapping>
> <troff>\(qb</troff>
> <unicode> </unicode>
> </mapping>
>
> I then wrote an XSLT script that reads the file of mappings and
> generates another XSLT script that will do the actual remapping. Here's
> a snippet of the generated XSLT script; this snippet is taken from
> within a template rule for text() nodes (the named template that gets
> called follows the snippet):
>
> <xsl:variable name="temp52">
> <xsl:call-template name="replacement">
> <xsl:with-param name="text">
> <xsl:value-of select="$temp51"/>
> </xsl:with-param>
> <xsl:with-param name="troff">\\\(\?s</xsl:with-param>
> <xsl:with-param name="unicode"> </xsl:with-param>
> </xsl:call-template>
> </xsl:variable>
> <xsl:variable name="temp53">
> <xsl:call-template name="replacement">
> <xsl:with-param name="text">
> <xsl:value-of select="$temp52"/>
> </xsl:with-param>
> <xsl:with-param name="troff">\\\(\?c</xsl:with-param>
> <xsl:with-param name="unicode"> </xsl:with-param>
> </xsl:call-template>
> </xsl:variable>
> . . .
> <xsl:template name="replacement">
> <xsl:param name="text"/>
> <xsl:param name="troff"/>
> <xsl:param name="unicode"/>
> <xsl:value-of select="replace($text, $troff, $unicode)"/>
> </xsl:template>
>
> The program logic is that for each text node, the template rule passes
> the textual contents to a replace() function that replaces a troff
> encoding with the corresponding Unicode value. The replace() function is
> then called again with the next mapping. The textual content is passed
> along through repeated remappings, and when it emerges on the other end,
> all multi-character troff sequences have been replaced with Unicode
> characters. There are 64 such mappings. I use replace() only for places
> where a multi-character troff string has to be replaced by a single
> Unicode character; at the end of the series of calls to replace() I use
> translate() to do the remaining one-to-one mappings (there are
> approximately 50 of them) in a single function call. The order of the
> mappings is (obviously) important; I need to remap longer strings before
> shorter ones, since the shorter ones may be subcomponents of the longer
> ones. In particular, I can remap individual characters (the one-to-one
> mappings) only after I've taken care of all of the many-to-one ones.
>
> The input file (XML with troff character coding instead of the desired
> Unicode) is 6.7MB and the Unicode output is 7.8MB. The transformation
> takes approximately five minutes to run, which feels like an eternity,
> but I'm not sure to what extent the execution time reflects the size of
> the input file and the number of replacements that needs to be
> performed, and to what extent it reflects inefficient program design.
> Can anyone suggest a revision that would provide a considerable
> improvement in efficiency (bearing in mind that the XSLT script that
> does the actual character remapping must be generated by XSLT from the
> mappings file)?
>
> Thanks,
>
> David
> djbpitt+xml@xxxxxxxx
>
>




--
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk

Current Thread