Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Date: Tue, 11 Oct 2016 20:19:41 -0000 |
Yes, then you need a more general solution--basically all your data has been corrupted by reading UTF-8 files as though they were ASCII but then saving the result as UTF-8, as Wolfgang surmised. There must be a general way to undo this corruption but I don't myself know of an existing tool that would do it. Basically you would need to scan the document text nodes for sequences of characters that, when interpreted as single bytes would represent the UTF-8 encoding of a Unicode character. I suspect that's actually not that hard but not a puzzle I can attempt at the moment. We know, for example, that "C"" corresponds to the first byte of a three-byte UTF-8 sequence, so searching for that and then doing something with the two characters following would do it and it's probably a simple mathematical relation between the Unicode characters and the bites in the UTF-8 encoding of the original character. Looking at the bytes of the UTF-8 encoding, the bytes for \u2019 are xE2 x80 x99 The corresponding mangled characters are: \u00E2 \u20AC \u2122 I don't see an obvious mathematical transform there but I'm also recovering from jet lag and not at my sharpest just now. Maybe somebody else sees a way to do this generally? One thing to try would be to simply list out all the bad character sequences to see what there is--more than one example may suggest a pattern. You may find that there are few enough you can just make a brute force replacement transform. Cheers, E. -- Eliot Kimber http://contrext.com From: "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> Reply-To: <xsl-list@xxxxxxxxxxxxxxxxxxxxxx> Date: Tuesday, October 11, 2016 at 3:55 PM To: <xsl-list@xxxxxxxxxxxxxxxxxxxxxx> Subject: Re: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 Hi Eliot On Tue, Oct 11, 2016 at 3:36 PM, Eliot Kimber ekimber@xxxxxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote: > The characters are not just the ASCII bytes. > > I think you will need to match on the characters in question and replace them > with the desired character, e.g.: > > <xsl:template match="text()[contains(., 'C"b,b"')]"> > <xsl:value-of select="replace(., 'C"b,b"', 'b')"/> > <xsl:template/> > > And then use a more complete identity transform that handles the text nodes: > Thank you for the response. I'm afraid I'm guilty of providing an incomplete picture of my issue: I'm not sure what malformed(?) characters are in the input documents. My apologies for leaving that detail out, but seems like it would present a fairly significant problem for doing a replace(). > Cheers, > > Eliot > Again, thank for your time and trouble. Bridger > > -- > Eliot Kimber > http://contrext.com > > > > From: "Bridger Dyson-Smith bdysonsmith@xxxxxxxxx" > <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> > Reply-To: <xsl-list@xxxxxxxxxxxxxxxxxxxxxx> > Date: Tuesday, October 11, 2016 at 2:59 PM > To: <xsl-list@xxxxxxxxxxxxxxxxxxxxxx> > Subject: [xsl] Character encoding/representation from ISO-8859-1 to UTF-8 > > <?xml version="1.0" encoding="iso-8859-1"?> > <documents> > <document>The reality of the effect of natural ventilation in a residential > attic cavity has been the topic of many debates and scholarly reports since > the 1930C"b,b"s.</document> > </documents> > XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> > EasyUnsubscribe <-list/1230532> (by email) XSL-List info and archive <http://www.mulberrytech.com/xsl/xsl-list> EasyUnsubscribe <-list/1278982> (by email <> )
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Character encoding/repres, Bridger Dyson-Smith | Thread | Re: [xsl] Character encoding/repres, Lizzi, Vincent vince |
Re: [xsl] Character encoding/repres, Steven D Majewski st | Date | Re: [xsl] Character encoding/repres, Bridger Dyson-Smith |
Month |