Re: [xsl] How to copy attribute value to text? (Suspected bug involving supplementary characters)

Subject: Re: [xsl] How to copy attribute value to text? (Suspected bug involving supplementary characters)
From: "Kenneth Reid Beesley krbeesley@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Thu, 7 Jul 2016 18:54:30 -0000
From: Kenneth Reid Beesley <krbeesley@xxxxxxxxx>
Subject: Re: [XSL-List: The Open Forum on XSL] Digest for 2016-07-06
Date: July 7, 2016 at 12:43:54 PM EDT
To: "XSL-List: The Open Forum on XSL"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>


Many thanks to Martin Honnen for his response below.  I add more comments
below (suspected bug in Saxon).


> On 7Jul2016, at 05:28, XSL-List: The Open Forum on XSL
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:
>
> From: Martin Honnen <martin.honnen@xxxxxx <mailto:martin.honnen@xxxxxx>>
> Subject: Re: [xsl] How to copy attribute value to text?
> Date: 7 July 2016 at 00:43:37 MDT
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
<mailto:xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
>
>
> On 07.07.2016 07:22, Kenneth Reid Beesley krbeesley@xxxxxxxxx
<mailto:krbeesley@xxxxxxxxx> wrote:
>> If I start with an input XML document that contains mixed text with <word>
elements like this:
>>
>> 	b& this is just <word correction=btoob>to</word> funny
>>
>> Ibd like to write an XSLT stylesheet that yields as output
>>
>> 	b& this is just <word origerror=btob>too</word> funny
>>
>> So in the output I effectively want (in the same <word> element) to
>>
>> 	1.  Set the value of a new attribute to the original text() value, and
>> 	2.  Reset the text() value to be the value of the original @correction
attribute
>>
>> Ibve tried many variants of the following, so far without success.  Ibm
using SaxonHE9-7-0-6J;
>> it runs, but the results are not as expected/hoped.
>
>> Ibve tried matching the text() in a separate template, but I canbt seem
to reference the attribute values of the parent node (i.e., <word>) of the
text() and the parent nodebs attributes.  E.g, the following doesnbt work
for me, failing somehow in the
>> select=b../@correctionb  reference.
>>
>> <xsl:template match=bword[@correction]/text()b>
>> 	<xsl:value-of select=b../@correctionb/>
>> </xsl:template>
>
>
> You can use
>
> 	<xsl:template match="@* | node()">
> 		<xsl:copy>
> 			<xsl:apply-templates select="@* | node()"/>
> 		</xsl:copy>
> 	</xsl:template>
>
> 	<xsl:template match="word[@correction]/text()">
> 		<xsl:value-of select="../@correction"/>
> 	</xsl:template>
>
> 	<xsl:template match="word/@correction">
> 		<xsl:attribute name="origerror" select=".."/>
> 	</xsl:template>

Your solution looks perfect and appears to work perfectly for ASCII-based XML
input examples like the following

<?xml version="1.0" encoding="UTF-8"?>

<foo>
  <bar>this is just <word correction="too">to</word> funny</bar>
</foo>

yielding the correct/desired output

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>this is just <word origerror="to">too</word> funny</bar>
</foo>


I now see that some of my own attempts also worked, on the same ASCII-based
example.

*****  Suspected bug involving supplementary characters *****

But my real task involves an input XML document, in UTF-8 encoding, that
consists of Deseret Alphabet characters, which are encoded in the
supplementary area.  In such a case, the resulting text content in the <word>
element, copied from an original attribute value, is corrupted.  I saw such
corruption in my own attempts, and couldnbt understand what was happening.

Using the following input document (the Deseret Alphabet characters may not
display correctly for you)

<?xml version="1.0" encoding="UTF-8"?>

<foo>
  <bar>pp.p p.p p>p2pp; <word
correction="p;p-">pp/p	p.</word> pp2pp.</bar>
</foo>

the output, using your script, is corrupted.  The text() value in the output
is not the same as the original @correction value.  Extra characters (just one
in this case) are inserted.  The longer the original attribute value, the more
extra characters are inserted.

<?xml version="1.0" encoding="UTF-8"?>
<foo>
  <bar>pp.p p.p p>p2pp; <word
origerror="pp/p	p.">p;p;p-</word> pp2pp.</bar>
</foo>

This kind of corruption is exactly what I was seeing using my own scripts,
leading me to bother the group.

I suspect a bug in the XSLT engine involving supplementary characters.  Again,
Ibm using SaxonHE9-7-0-6J.

Whatbs my next step?

Thanks,

Ken

********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA










********************************
Kenneth R. Beesley, D.Phil.
PO Box 540475
North Salt Lake UT 84054
USA

Current Thread