Subject: Re: [xsl] Trouble with special characters From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Mon, 25 Jan 2016 22:46:19 -0000 |
Yes, I used ASCII as a generic term of ISO-8859, but Peter is correct, I should be more precise. Perhaps I'm showing my age. Cheers, E. ---- Eliot Kimber, Owner Contrext, LLC http://contrext.com On 1/25/16, 3:57 PM, "Peter West lists@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: >Replace ASCII9 in the following with ISO-8859-19? > >Peter West > as they were delivered to us by those who from the beginning were >eyewitnesses > >> On 26 Jan 2016, at 5:36 am, Eliot Kimber ekimber@xxxxxxxxxxxx >><xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: >> >> For a situation like this you have to look closely at the chain of >>custody >> of the data as it comes in and out of different tools--any component >>that >> touches it has the opportunity to mess things up. >> >> As others have pointed out, if the data coming in is correct then the >>data >> going out as produced directly by Saxon should be correct as well. That >> is, the mapping from Unicode characters to ISO-8859 should be handled >> correctly by the serializer Saxon is using. >> >> The "gibbersh" you're showing is the three bytes of the UTF-8 encoded >> "REPLACEMENT CHARACTER" interpreted as individual Unicode characters. >>The >> UTF-8 encoding of this character, Unicode code point FFFD, is 0xEF 0xBF >> 0xBD. Character 0xEF (239) is i-umlaut in ISO-8859, 0xBF (191) is >>inverted >> question mark, and 0xBD (189) is the 1/2 fraction. Thus your gibbersh. >> (http://www.fileformat.info/info/unicode/char/0fffd/index.htm) >> >> So the following is happening somewhere in your tool chain: >> >> 1. Something is not recognizing the character you think should be a >>degree >> symbol as a known Unicode character and is replacing it with the UTF-8 >> replacement character. >> >> 2. Something is then reading the bytes resulting from (1) as ASCII >>rather >> than UTF-8 and treating each byte of the replacement character sequence >>as >> individual ASCII characters. >> >> 3. The remaining stages don't know any better and continue to treat the >> characters as characters, resulting in the three characters i-umlaut, >> inverted question mark, 1/2 fraction in the output. >> >> I think the most likely thing is that something is reading the incoming >> ASCII as Unicode, not recognizing the ASCII byte "0xB0" (degree symbol) >>as >> a unicode character (because it's not one in any Unicode-defined >> encoding), and replacing it with the Unicode replacement character. >> >> Something then reads this byte sequence as ASCII, not UTF-8 but then >> generates UTF-8 output (otherwise the byte sequence would be the same on >> input and output), resulting in the gibberish. >> >> Some tools write XML in one encoding but put in a different encoding >> declaration, e.g., a file is written as ISO-8859 but with a UTF-8 >>encoding >> declaration. This would lead to the behavior we're seeing here, where >>the >> degree symbol should be encoded as two UTF-8 bytes but is output as a >> single ASCII byte. >> >> Using Java it's easy to forget to specify the encoding when writing a >>byte >> sequence using a Writer or when constructing a String instance. This >>will >> result in the bytes being written in the default encoding for the system >> running the application, which is almost always *not* a Unicode >>encoding, >> rather than an Unicode encoding. Other languages have similar pitfalls. >> >> I find the free Windows tool Unipad to be invaluable when trying to >>track >> down this type of encoding problem--it does a good job of guessing the >> real encoding and also has tools for converting between many encodings, >> inspecting files in uncommon encodings, and so on. However, oXygenXML >>has >> a lot of good tools for this now, so I depend on Unipad less than I used >> to 10 years ago. (http://www.unipad.org/main/)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Trouble with special char, Peter West lists@xxx | Thread | [xsl] Function for determining one , Adam Retter adam.ret |
Re: [xsl] Trouble with special char, Eliot Kimber ekimber | Date | [xsl] Function for determining one , Adam Retter adam.ret |
Month |