[xsl] Re: replace() and translate() second try

Subject: [xsl] Re: replace() and translate() second try
From: Kenneth Reid Beesley <krbeesley@xxxxxxxxx>
Date: Sat, 4 Jun 2011 11:35:56 -0600
Thanks to Michael Kay and David Carlisle for their responses.

In an example like
	translate(string, "wxyz", "ABCD")

where w, x, y and z are supplementary characters, I find (using saxonhe9-3)
that it works if the supplementary characters are indicated with the
hex-escape
&#xHHHHHHHH; notation, but _not_ if the supplementary characters are simply
typed in
using a Unicode-savvy text editor that handles supplementary characters.  In
case
the hex-escape sequence I just typed got garbled by email filters, it consists
of
an ampersand, a pound/hash sign, an 'x', and a sequence of hex digits,
terminated with a semicolon.
Thanks to David Carlisle for suggesting the hex-escape notation.

My original XML file (containing supplementary Unicode characters from
the Deseret Alphabet block) and my XSLT script are both in UTF-8 encoding.

So something like this works:

	translate(string, '&#x10428;&#x10429;&#x1042A;&#x1042B;' , 'ABCD')

In case things get garbled again by email filters, the second argument to
translate() contains (without the spaces shown here)
four supplementary characters indicated by the hex code point values:

		& #x 10428 ;
		& #x 10429 ;
		& #x 1042A ;
		& #x 1042B ;

If, using a Unicode-savvy text editor, with UTF-8 encoding for the file, I
simply type in the four supplementary characters
in the second string argument, this script does not work.  This is a shame
because the script with the real characters
is far more readable (if you have a unicode editor that can display the
supplementary character glyphs).

The same for the replace(string, 'orig', 'repl') function.  If the second
argument contains supplementary characters, they need
to be indicated in the hex-escape notation or I get results that are at least
inconsistent.

I have a little example that I would gladly forward to anyone who is
interested.

Thanks,

Ken




>
> ----------------------------------------------------------------------
> Date: Fri, 3 Jun 2011 00:00:47 -0600
> To: xslt <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
> From: Kenneth Reid Beesley <krbeesley@xxxxxxxxx>
> Subject: replace() and translate() second try
> Message-Id: <ACE8B20D-B21E-4551-8852-4EEE0EF398A7@xxxxxxxxx>
>
> I see that my previous message got rather garbled.  Here's a simplified =
> version of the question.
> Assume we have an XSLT transform with something like
>
> 	translate(string, 'abcd', 'ABCD')
>
> Obviously 'a' gets replaced with 'A', 'b' with 'B', etc.
>
> Should this still work if the 'abcd' is replace by a string of 4  =
> Unicode _supplementary_ characters?
> That is, does translate() work with Characters (including supplementary =
> characters) or just chars?
>
> Thanks,
>
> Ken
>
> ******************************
> Kenneth R. Beesley, D.Phil.
> P.O. Box 540475
> North Salt Lake, UT
> 84054  USA
>
> ------------------------------
>
> Date: Fri, 03 Jun 2011 08:05:08 +0100
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> From: Michael Kay <mike@xxxxxxxxxxxx>
> Subject: Re: [xsl] replace() and translate() second try
> Message-ID: <4DE887A4.8000905@xxxxxxxxxxxx>
>
> On 03/06/2011 07:00, Kenneth Reid Beesley wrote:
>> I see that my previous message got rather garbled.  Here's a simplified
version of the question.
>> Assume we have an XSLT transform with something like
>>
>> 	translate(string, 'abcd', 'ABCD')
>>
>> Obviously 'a' gets replaced with 'A', 'b' with 'B', etc.
>>
>> Should this still work if the 'abcd' is replace by a string of 4  Unicode
_supplementary_ characters?
>> That is, does translate() work with Characters (including supplementary
characters) or just chars?
>>
>
> Yes, it should work correctly, and I have tests to show that it does, so
> please raise a bug report with a reproducible test case.
>
> The replace() function should also work with all Unicode characters,
> though there may be question marks here about which version of Unicode
> the characters are defined in, especially if you are trying to match
> them against Unicode character categories such as \p{Ll}.
>
> Michael Kay
> Saxonica
>
>
> ------------------------------
>
> Date: Fri, 03 Jun 2011 09:09:40 +0100
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> From: David Carlisle <davidc@xxxxxxxxx>
> Cc: Kenneth Reid Beesley <krbeesley@xxxxxxxxx>
> Subject: Re: [xsl] replace(), translate() and Unicode supplementary
characters
> Message-ID: <4DE896C4.6060309@xxxxxxxxx>
>
> On 03/06/2011 05:00, Kenneth Reid Beesley wrote:
>> Questions:  Are translate() and replace() supposed to work with Unicode
supplementary characters?
>
> yes
>
>> If so, what am I doing wrong?
>
> hard to say as there is rather a large chance that the input you showed
> has been through several aggressive mail filters.
>
> Chances are it's an encoding error and one way to avoid those is to code
> your stylesheet in ascii.
>
> translate(
> .,
> '& #x10428& #x10429;& #x1042A& #x1042B;',
> '& #x0069;& #x0065;& #x0251;& #x0254;'
> )
>
> without the spaces after the &
> should work for the translation you cited.
>
> David

Current Thread