Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8
From: "Lizzi, Vincent vincent.lizzi@xxxxxxxxxxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 11 Oct 2016 19:55:01 -0000
Hi Bridger,

You may be able to use xsl:character-map to map characters that are not
transforming correctly into their proper Unicode code points.

Ibve seen plenty of instances where the input files declare one character
encoding but actually contain characters with a different encoding. If this is
what youbre facing, it can be helpful to start by doing an analysis of
character occurrences in the set of input files. You can eliminate characters
in the ISO646-US range straight off, then eliminate character other codes that
transform correctly, and then focus on creating a mapping for the remaining
character codes or character code sequences.

Some Perl modules that can be helpful when dealing with unexpected character
encodings are Encoding::FixLatin, Encode::Guess, and Text::FixEOL.

Cheers,
Vincent


From: Bridger Dyson-Smith bdysonsmith@xxxxxxxxx
[mailto:xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx]
Sent: Tuesday, October 11, 2016 3:09 PM
To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
Subject: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

Hi all,

I'm struggling with a character encoding issue (or a character representation
issue maybe?): I have input XML that looks like this

input.xml
<?xml version="1.0" encoding="iso-8859-1"?>
<documents>
            <document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and scholarly
reports since the 1930C"b,b"s.</document>
</documents>

and I would like to get it to a point where the characters are represented
properly, i.e.

output.xml
<?xml version="1.0" encoding="UTF-8"?>
<documents>
            <document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and scholarly
reports since the 1930bs.</document>
</documents>

Thanks to Liam's help on irc and reading through the list archives, it seems
like an identity transform should be the right step towards getting the
representation corrected, but something isn't working (or I have a
misunderstanding somewhere).

If I apply the following identity transform with Saxon HE 9.6.0.7 in oXygen
18:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform<http://www.w3.org/1999/XSL/Tr
ansform>"
            version="2.0">
                        <xsl:output encoding="UTF-8" indent="yes"/>
                        <xsl:template match="/"><xsl:copy-of
select="/"/></xsl:template>
</xsl:stylesheet>

I get the following result:
<?xml version="1.0" encoding="UTF-8"?>
<documents>
             <document>The reality of the effect of natural ventilation in a
residential attic cavity has been the topic of many debates and scholarly
reports since the 1930C"b,b"s.</document>
</documents>

Could someone provide some insight into what I've done wrong here? Any help
would be greatly appreciated.

Best,
Bridger

XSL-List info and archive<http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe<-list/194671> (by email<>)

Current Thread