Re: [xsl] Trouble with special characters

Subject: Re: [xsl] Trouble with special characters
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2016 19:35:04 -0000
For a situation like this you have to look closely at the chain of custody
of the data as it comes in and out of different tools--any component that
touches it has the opportunity to mess things up.

As others have pointed out, if the data coming in is correct then the data
going out as produced directly by Saxon should be correct as well. That
is, the mapping from Unicode characters to ISO-8859 should be handled
correctly by the serializer Saxon is using.

The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
"REPLACEMENT CHARACTER" interpreted as individual Unicode characters. The
UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is inverted
question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
(http://www.fileformat.info/info/unicode/char/0fffd/index.htm)

So the following is happening somewhere in your tool chain:

1. Something is not recognizing the character you think should be a degree
symbol as a known Unicode character and is replacing it with the UTF-8
replacement character.

2. Something is then reading the bytes resulting from (1) as ASCII rather
than UTF-8 and treating each byte of the replacement character sequence as
individual ASCII characters.

3. The remaining stages don't know any better and continue to treat the
characters as characters, resulting in the three characters i-umlaut,
inverted question mark, 1/2 fraction in the output.

I think the most likely thing is that something is reading the incoming
ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol) as
a unicode character (because it's not one in any Unicode-defined
encoding), and replacing it with the Unicode replacement character.

Something then reads this byte sequence as ASCII, not UTF-8 but then
generates UTF-8 output (otherwise the byte sequence would be the same on
input and output), resulting in the gibberish.

Some tools write XML in one encoding but put in a different encoding
declaration, e.g., a file is written as ISO-8859 but with a UTF-8 encoding
declaration. This would lead to the behavior we're seeing here, where the
degree symbol should be encoded as two UTF-8 bytes but is output as a
single ASCII byte.

Using Java it's easy to forget to specify the encoding when writing a byte
sequence using a Writer or when constructing a String instance. This will
result in the bytes being written in the default encoding for the system
running the application, which is almost always *not* a Unicode encoding,
rather than an Unicode encoding. Other languages have similar pitfalls.

I find the free Windows tool Unipad to be invaluable when trying to track
down this type of encoding problem--it does a good job of guessing the
real encoding and also has tools for converting between many encodings,
inspecting files in uncommon encodings, and so on. However, oXygenXML has
a lot of good tools for this now, so I depend on Unipad less than I used
to 10 years ago. (http://www.unipad.org/main/)

Good luck.

Cheers,

Eliot

----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 1/25/16, 12:36 PM, "a kusa akusa8@xxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>The transformed XML itself has the gibberish value for the degree
>symbol. So it displays as question marks in IE.
>
>There is a java program that uses the transformation factory to
>convert the XML. I view the results XML Spy.
>
>On Mon, Jan 25, 2016 at 12:17 PM, Martin Honnen martin.honnen@xxxxxx
><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>> a kusa akusa8@xxxxxxxxx wrote:
>>>
>>> And you have <xsl:output omit-xml-declaration="no"/> as well? Does the
>>> result have an XML declaration? -Yes, there is an XML declaration.
>>>
>>> Does XML Spy indicate the encoding used to display the file?- Not sure
>>> where to see this. The transformed XML has the encoding set to
>>> ISO-8859-1.
>>
>>
>> What happens when you load the XML result into a browser like IE or
>>Firefox?
>> Are the characters displayed as you want them?
>>
>> As for using Saxon, how do you use, do you run it from the command line
>> yourself, with -o:result.xml output option? Or is XML Spy running Saxon
>>and
>> maybe not doing it right?

Current Thread