Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8

Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 11 Oct 2016 20:19:41 -0000
Yes, then you need a more general solution--basically all your data has been
corrupted by reading UTF-8 files as though they were ASCII but then saving
the result as UTF-8, as Wolfgang surmised.

There must be a general way to undo this corruption but I don't myself know
of an existing tool that would do it. Basically you would need to scan the
document text nodes for sequences of characters that, when interpreted as
single bytes would represent the UTF-8 encoding of a Unicode character. I
suspect that's actually not that hard but not a puzzle I can attempt at the
moment. We know, for example, that "C"" corresponds to the first byte of a
three-byte UTF-8 sequence, so searching for that and then doing something
with the two characters following would do it and it's probably a simple
mathematical relation between the Unicode characters and the bites in the
UTF-8 encoding of the original character.

Looking at the bytes of the UTF-8 encoding, the bytes for \u2019 are xE2 x80
x99

The corresponding mangled characters are: \u00E2 \u20AC \u2122

I don't see an obvious mathematical transform there but I'm also recovering
from jet lag and not at my sharpest just now.

Maybe somebody else sees a way to do this generally?

One thing to try would be to simply list out all the bad character sequences
to see what there is--more than one example may suggest a pattern.

You may find that there are few enough you can just make a brute force
replacement transform.

Cheers,

E.
--
Eliot Kimber
http://contrext.com



From:  "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx"
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Reply-To:  <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
Date:  Tuesday, October 11, 2016 at 3:55 PM
To:  <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
Subject:  Re: [xsl] Character encoding/representation from ISO-8859-1 to
UTF-8

Hi Eliot

On Tue, Oct 11, 2016 at 3:36 PM, Eliot Kimber ekimber@xxxxxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:
> The characters are not just the ASCII bytes.
>
> I think you will need to match on the characters in question and replace
them
> with the desired character, e.g.:
>
> <xsl:template match="text()[contains(., 'C"b,b"')]">
> <xsl:value-of select="replace(., 'C"b,b"', 'b')"/>
> <xsl:template/>
>
> And then use a more complete identity transform that handles the text
nodes:
>
Thank you for the response. I'm afraid I'm guilty of providing an incomplete
picture of my issue: I'm not sure what malformed(?) characters are in the
input documents. My apologies for leaving that detail out, but seems like it
would present a fairly significant problem for doing a replace().

> Cheers,
>
> Eliot
>
Again, thank for your time and trouble.
Bridger
>
> --
> Eliot Kimber
> http://contrext.com
>
>
>
> From:  "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx"
> <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
> Reply-To:  <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
> Date:  Tuesday, October 11, 2016 at 2:59 PM
> To:  <xsl-list@xxxxxxxxxxxxxxxxxxxxxx>
> Subject:  [xsl] Character encoding/representation from ISO-8859-1 to UTF-8
>
> <?xml version="1.0" encoding="iso-8859-1"?>
> <documents>
> <document>The reality of the effect of natural ventilation in a residential
> attic cavity has been the topic of many debates and scholarly reports since
> the 1930C"b,b"s.</document>
> </documents>
> XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
> EasyUnsubscribe <-list/1230532> (by email)

XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list>
EasyUnsubscribe <-list/1278982> (by
email <> )

Current Thread