Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 From: "Lizzi, Vincent vincent.lizzi@xxxxxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Tue, 11 Oct 2016 19:55:01 -0000 |
Hi Bridger, You may be able to use xsl:character-map to map characters that are not transforming correctly into their proper Unicode code points. Ibve seen plenty of instances where the input files declare one character encoding but actually contain characters with a different encoding. If this is what youbre facing, it can be helpful to start by doing an analysis of character occurrences in the set of input files. You can eliminate characters in the ISO646-US range straight off, then eliminate character other codes that transform correctly, and then focus on creating a mapping for the remaining character codes or character code sequences. Some Perl modules that can be helpful when dealing with unexpected character encodings are Encoding::FixLatin, Encode::Guess, and Text::FixEOL. Cheers, Vincent From: Bridger Dyson-Smith bdysonsmith@xxxxxxxxx [mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx] Sent: Tuesday, October 11, 2016 3:09 PM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 Hi all, I'm struggling with a character encoding issue (or a character representation issue maybe?): I have input XML that looks like this input.xml <?xml version="1.0" encoding="iso-8859-1"?> <documents> <document>The reality of the effect of natural ventilation in a residential attic cavity has been the topic of many debates and scholarly reports since the 1930C"b,b"s.</document> </documents> and I would like to get it to a point where the characters are represented properly, i.e. output.xml <?xml version="1.0" encoding="UTF-8"?> <documents> <document>The reality of the effect of natural ventilation in a residential attic cavity has been the topic of many debates and scholarly reports since the 1930bs.</document> </documents> Thanks to Liam's help on irc and reading through the list archives, it seems like an identity transform should be the right step towards getting the representation corrected, but something isn't working (or I have a misunderstanding somewhere). If I apply the following identity transform with Saxon HE 9.6.0.7 in oXygen 18: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform<http://www.w3.org/1999/XSL/Tr ansform>" version="2.0"> <xsl:output encoding="UTF-8" indent="yes"/> <xsl:template match="/"><xsl:copy-of select="/"/></xsl:template> </xsl:stylesheet> I get the following result: <?xml version="1.0" encoding="UTF-8"?> <documents> <document>The reality of the effect of natural ventilation in a residential attic cavity has been the topic of many debates and scholarly reports since the 1930C"b,b"s.</document> </documents> Could someone provide some insight into what I've done wrong here? Any help would be greatly appreciated. Best, Bridger XSL-List info and archive<http://www.mulberrytech.com/xsl/xsl-list> EasyUnsubscribe<-list/194671> (by email<>)
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Character encoding/repres, Eliot Kimber ekimber | Thread | Re: [xsl] Character encoding/repres, Steven D Majewski st |
Re: [xsl] Character encoding/repres, Bridger Dyson-Smith | Date | Re: [xsl] Character encoding/repres, Bridger Dyson-Smith |
Month |