Re: [xsl] Trouble with special characters

Subject: Re: [xsl] Trouble with special characters
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Mon, 25 Jan 2016 22:46:19 -0000
Yes, I used ASCII as a generic term of ISO-8859, but Peter is correct, I
should be more precise. Perhaps I'm showing my age.

Cheers,

E.
----
Eliot Kimber, Owner
Contrext, LLC
http://contrext.com




On 1/25/16, 3:57 PM, "Peter West lists@xxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

>Replace ASCII9 in the following with ISO-8859-19?
>
>Peter West
>
as they were delivered to us by those who from the beginning were
>eyewitnesses

>
>> On 26 Jan 2016, at 5:36 am, Eliot Kimber ekimber@xxxxxxxxxxxx
>><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
>>
>> For a situation like this you have to look closely at the chain of
>>custody
>> of the data as it comes in and out of different tools--any component
>>that
>> touches it has the opportunity to mess things up.
>>
>> As others have pointed out, if the data coming in is correct then the
>>data
>> going out as produced directly by Saxon should be correct as well. That
>> is, the mapping from Unicode characters to ISO-8859 should be handled
>> correctly by the serializer Saxon is using.
>>
>> The "gibbersh" you're showing is the three bytes of the UTF-8 encoded
>> "REPLACEMENT CHARACTER" interpreted as individual Unicode characters.
>>The
>> UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF
>> 0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is
>>inverted
>> question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh.
>> (http://www.fileformat.info/info/unicode/char/0fffd/index.htm)
>>
>> So the following is happening somewhere in your tool chain:
>>
>> 1. Something is not recognizing the character you think should be a
>>degree
>> symbol as a known Unicode character and is replacing it with the UTF-8
>> replacement character.
>>
>> 2. Something is then reading the bytes resulting from (1) as ASCII
>>rather
>> than UTF-8 and treating each byte of the replacement character sequence
>>as
>> individual ASCII characters.
>>
>> 3. The remaining stages don't know any better and continue to treat the
>> characters as characters, resulting in the three characters i-umlaut,
>> inverted question mark, 1/2 fraction in the output.
>>
>> I think the most likely thing is that something is reading the incoming
>> ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol)
>>as
>> a unicode character (because it's not one in any Unicode-defined
>> encoding), and replacing it with the Unicode replacement character.
>>
>> Something then reads this byte sequence as ASCII, not UTF-8 but then
>> generates UTF-8 output (otherwise the byte sequence would be the same on
>> input and output), resulting in the gibberish.
>>
>> Some tools write XML in one encoding but put in a different encoding
>> declaration, e.g., a file is written as ISO-8859 but with a UTF-8
>>encoding
>> declaration. This would lead to the behavior we're seeing here, where
>>the
>> degree symbol should be encoded as two UTF-8 bytes but is output as a
>> single ASCII byte.
>>
>> Using Java it's easy to forget to specify the encoding when writing a
>>byte
>> sequence using a Writer or when constructing a String instance. This
>>will
>> result in the bytes being written in the default encoding for the system
>> running the application, which is almost always *not* a Unicode
>>encoding,
>> rather than an Unicode encoding. Other languages have similar pitfalls.
>>
>> I find the free Windows tool Unipad to be invaluable when trying to
>>track
>> down this type of encoding problem--it does a good job of guessing the
>> real encoding and also has tools for converting between many encodings,
>> inspecting files in uncommon encodings, and so on. However, oXygenXML
>>has
>> a lot of good tools for this now, so I depend on Unipad less than I used
>> to 10 years ago. (http://www.unipad.org/main/)

Current Thread