Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8
From: "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 11 Oct 2016 19:51:14 -0000
Hi Gerrit,

On Tue, Oct 11, 2016 at 3:29 PM, Imsieke, Gerrit, le-tex
gerrit.imsieke@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> But do we know that the characters are just bytes?
>
> Sometimes UTF-8 is being read as if it were ISO-8859-1 or CP-1252 (which
> is more likely on Windows) and then saved as UTF-8. Then C"b,b" are 3
> (multibyte) UTF-8 characters.
>
> This is very similar to some of the advice that Liam shared with me; i.e.
something from a Windows server (I'm fairly sure that's the OS for the
application generating the $input.xml files) is reading UTF-8 and outputing
it as ISO-8859-1.


> If this is the case, you can correct it with
>
> iconv -t WINDOWS-1252 -f UTF-8 input.xml | sed -e 's/
> encoding="iso-8859-1"/ encoding="UTF-8"/' > output.xml
>
> :) now *this* is different. This replaces the ISO/CP-1252/... with U+FFFD,
which is arguably an improvement.


> Gerrit


Bridger


>
>
> On 11.10.2016 21:23, Wolfgang Laun wolfgang.laun@xxxxxxxxx wrote:
>
>> The characters E2 80 99 are the UTF-8 encoding of the Unicode character
>> RIGHT SINGLE QUOTATION MARK.
>>
>> Simply changing the ISO-8859-1 in your XML file to UTF-8 should fix this.
>>
>>
>> On 11 October 2016 at 21:00, Bridger Dyson-Smith bdysonsmith@xxxxxxxxx
>> <mailto:bdysonsmith@xxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx
>>
>> <mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote:
>>
>>     Hi all,
>>
>>     I'm struggling with a character encoding issue (or a character
>>     representation issue maybe?): I have input XML that looks like this
>>
>>     input.xml
>>     <?xml version="1.0" encoding="iso-8859-1"?>
>>     <documents>
>>     <document>The reality of the effect of natural ventilation in a
>>     residential attic cavity has been the topic of many debates and
>>     scholarly reports since the 1930C"b,b"s.</document>
>>     </documents>
>>
>>     and I would like to get it to a point where the characters are
>>     represented properly, i.e.
>>
>>     output.xml
>>     <?xml version="1.0" encoding="UTF-8"?>
>>     <documents>
>>     <document>The reality of the effect of natural ventilation in a
>>     residential attic cavity has been the topic of many debates and
>>     scholarly reports since the 1930bs.</document>
>>     </documents>
>>
>>     Thanks to Liam's help on irc and reading through the list archives,
>>     it seems like an identity transform should be the right step towards
>>     getting the representation corrected, but something isn't working
>>     (or I have a misunderstanding somewhere).
>>
>>     If I apply the following identity transform with Saxon HE 9.6.0.7 in
>>     oXygen 18:
>>     <?xml version="1.0" encoding="UTF-8"?>
>>     <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform
>>     <http://www.w3.org/1999/XSL/Transform>"
>>     version="2.0">
>>     <xsl:output encoding="UTF-8" indent="yes"/>
>>     <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template>
>>     </xsl:stylesheet>
>>
>>     I get the following result:
>>     <?xml version="1.0" encoding="UTF-8"?>
>>     <documents>
>>      <document>The reality of the effect of natural ventilation in a
>>     residential attic cavity has been the topic of many debates and
>>     scholarly reports since the 1930C"&#x80;&#x99;s.</document>
>>     </documents>
>>
>>     Could someone provide some insight into what I've done wrong here?
>>     Any help would be greatly appreciated.
>>
>>     Best,
>>     Bridger
>>
>>     XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
>>     EasyUnsubscribe <-list/528976> (by email)
>>
>>
>> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
>> EasyUnsubscribe <-list/225679>
>> (by email <>)
>>
>
> --
> Gerrit Imsieke
> GeschC$ftsfC<hrer / Managing Director
> le-tex publishing services GmbH
> Weissenfelser Str. 84, 04229 Leipzig, Germany
> Phone +49 341 355356 110, Fax +49 341 355356 510
> gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de
>
> Registergericht / Commercial Register: Amtsgericht Leipzig
> Registernummer / Registration Number: HRB 24930
>
> GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek,
> Thomas Schmidt, Dr. Reinhard VC6ckler
> ------------------------------------------------------------
> ------------------
> Meet us at Frankfurt Book Fair:
> Hall 4.2, Stand L68.
> More info at http://www.le-tex.de/en/buchmesse.html

Current Thread