Re: [xsl] how to extract text, translate and re-insert it in XHTML

Subject: Re: [xsl] how to extract text, translate and re-insert it in XHTML
From: Evan Lenz <evan@xxxxxxxxxxxx>
Date: Wed, 07 Jan 2009 12:48:41 -0800
That sounds like an interesting problem. If the English->Klingon translator would leave a trace of what translated to what, then it might be feasible (though difficult) to reconstruct the inline markup. Failing that, it seems nigh impossible. But that's assuming the document uses inline markup (which you didn't explicitly specify). If it's a matter of just getting different sections back in place, then you'd probably make multiple calls out to the translator, one for each blob of text. Of course, I suppose you could try the same for inline markup. It just might come out reading a bit funny and disconnected (but I suppose that's to be expected from an automatic translator anyway...).

<p>Hello this is <strong>bold</strong>. This is <em>italic</em>.</p>

You could call the translator for each non-whitespace-only text node in the document.

<xsl:template match="/">

<!-- Ignore whitespace-only text -->
<xsl:template match="text()"/>

<xsl:template match="text()[normalize-space()]">

For the above document, that would yield:

 <to-translator>Hello this is </to-translator>
 <to-translator>. This is </to-translator>

This reveals a further requirement: strip out and reconstruct punctuation that lies at the edges of a text blob (and that the translator would likely ignore anyway). You could do this using regular expressions. I'm not going to trouble myself with that right now, but the result might look like this:

 <to-translator>Hello this is </to-translator>
 <to-translator sentence-boundary="yes>This is </to-translator>
 <to-translator sentence-boundary="yes"/>

I wouldn't worry about commas so much, or even periods in the middle of a blob of text. Theoretically, the translator will take care of those. It's only when we chop up text near the sentence boundaries (due to inline markup, e.g., a <b> tag) that we'd have to worry about that.

Then you'd hope to construct a result like this with help from the translator:

 <from-translator>Olleh siht si </from-translator>
 <from-translator sentence-boundary="yes">Siht si </from-translator>
 <from-translator sentence-boundary="yes"/>

Reconstructing the document, you'd run another transformation against the original document, changing only the non-whitespace-only text nodes:

<!-- By default, copy everything unchanged. -->
<xsl:template match="@* | node()">
   <xsl:apply-templates select="@* | node()"/>

<!-- But replace non-whitespace-only text nodes with their translated counterparts. -->
<xsl:template match="text()[normalize-space()]">
<xsl:variable name="text-node-position">
<xsl:number level="any" count="text()[normalize-space()]/>
<xsl:variable name="result"
<xsl:if test="$result/@sentence-boundary='yes">. </xsl:if>
<xsl:value-of select="$result"/>

I'll leave it up to you to determine whether the results would be acceptable or not. I think it largely depends on just how much inline markup is being used. Perhaps you'd care less about preserving bold, italics, and other inline markup and care only about paragraph boundaries. That would be much easier, using a similar approach to above. In that case, a text blob would be passed to the translator for each paragraph rather than every last text node. Either way, we can identify each blob of text by position.


Robert P. J. Day wrote:
  it's been a while since i've written anything in XSLT so i'm going
to try to explain what a colleague is trying to do, assuming *i*
understand it.

  1) start with an involved XHTML document
  2) "extract" just those (english) parts that involve translatable
     text, and hand it to a translator
  3) translator translates english to, say, klingon
  4) rebuild original document with klingon content instead of english

as i understand it, the point of the extraction is that no one wants
to burden the translator with all of the XHTML tagging -- the
translator wants to get the text stripped of all the "clutter", at
which point, after translation, someone needs to be able to put the
document back together.

  is this even a reasonable thing to ask?  in order to reassemble the
document, i'm assuming one is going to have to ID every single bit of
text to have a reference to build backwards.

  thoughts on this?  has anyone done something like this?  or are you
all too busy laughing hysterically by now?


Current Thread