Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 From: "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Tue, 11 Oct 2016 19:51:14 -0000 |
Hi Gerrit, On Tue, Oct 11, 2016 at 3:29 PM, Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > But do we know that the characters are just bytes? > > Sometimes UTF-8 is being read as if it were ISO-8859-1 or CP-1252 (which > is more likely on Windows) and then saved as UTF-8. Then C"b,b" are 3 > (multibyte) UTF-8 characters. > > This is very similar to some of the advice that Liam shared with me; i.e. something from a Windows server (I'm fairly sure that's the OS for the application generating the $input.xml files) is reading UTF-8 and outputing it as ISO-8859-1. > If this is the case, you can correct it with > > iconv -t WINDOWS-1252 -f UTF-8 input.xml | sed -e 's/ > encoding="iso-8859-1"/ encoding="UTF-8"/' > output.xml > > :) now *this* is different. This replaces the ISO/CP-1252/... with U+FFFD, which is arguably an improvement. > Gerrit Bridger > > > On 11.10.2016 21:23, Wolfgang Laun wolfgang.laun@xxxxxxxxx wrote: > >> The characters E2 80 99 are the UTF-8 encoding of the Unicode character >> RIGHT SINGLE QUOTATION MARK. >> >> Simply changing the ISO-8859-1 in your XML file to UTF-8 should fix this. >> >> >> On 11 October 2016 at 21:00, Bridger Dyson-Smith bdysonsmith@xxxxxxxxx >> <mailto:bdysonsmith@xxxxxxxxx> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx >> >> <mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>> wrote: >> >> Hi all, >> >> I'm struggling with a character encoding issue (or a character >> representation issue maybe?): I have input XML that looks like this >> >> input.xml >> <?xml version="1.0" encoding="iso-8859-1"?> >> <documents> >> <document>The reality of the effect of natural ventilation in a >> residential attic cavity has been the topic of many debates and >> scholarly reports since the 1930C"b,b"s.</document> >> </documents> >> >> and I would like to get it to a point where the characters are >> represented properly, i.e. >> >> output.xml >> <?xml version="1.0" encoding="UTF-8"?> >> <documents> >> <document>The reality of the effect of natural ventilation in a >> residential attic cavity has been the topic of many debates and >> scholarly reports since the 1930bs.</document> >> </documents> >> >> Thanks to Liam's help on irc and reading through the list archives, >> it seems like an identity transform should be the right step towards >> getting the representation corrected, but something isn't working >> (or I have a misunderstanding somewhere). >> >> If I apply the following identity transform with Saxon HE 9.6.0.7 in >> oXygen 18: >> <?xml version="1.0" encoding="UTF-8"?> >> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform >> <http://www.w3.org/1999/XSL/Transform>" >> version="2.0"> >> <xsl:output encoding="UTF-8" indent="yes"/> >> <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template> >> </xsl:stylesheet> >> >> I get the following result: >> <?xml version="1.0" encoding="UTF-8"?> >> <documents> >> <document>The reality of the effect of natural ventilation in a >> residential attic cavity has been the topic of many debates and >> scholarly reports since the 1930C"€™s.</document> >> </documents> >> >> Could someone provide some insight into what I've done wrong here? >> Any help would be greatly appreciated. >> >> Best, >> Bridger >> >> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> >> EasyUnsubscribe <-list/528976> (by email) >> >> >> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> >> EasyUnsubscribe <-list/225679> >> (by email <>) >> > > -- > Gerrit Imsieke > GeschC$ftsfC<hrer / Managing Director > le-tex publishing services GmbH > Weissenfelser Str. 84, 04229 Leipzig, Germany > Phone +49 341 355356 110, Fax +49 341 355356 510 > gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de > > Registergericht / Commercial Register: Amtsgericht Leipzig > Registernummer / Registration Number: HRB 24930 > > GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek, > Thomas Schmidt, Dr. Reinhard VC6ckler > ------------------------------------------------------------ > ------------------ > Meet us at Frankfurt Book Fair: > Hall 4.2, Stand L68. > More info at http://www.le-tex.de/en/buchmesse.html
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Character encoding/repres, Imsieke, Gerrit, le- | Thread | Re: [xsl] Character encoding/repres, Bridger Dyson-Smith |
Re: [xsl] Character encoding/repres, Bridger Dyson-Smith | Date | Re: [xsl] Character encoding/repres, Lizzi, Vincent vince |
Month |