Subject: RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8 From: Tony Graham <Tony.Graham@xxxxxxx> Date: Wed, 24 Jul 2002 11:10:59 +0100 |
Michael Kay wrote at 24 Jul 2002 09:05:31 +0100: > > > ISO-8859-1 can only encode the characters in the > > > range 0-255. > > > > That's what I thought as well. How did saxon > > converted those two control chars into the proper > > encoding for “ and ” even though the input > > XML was marked as encoding in ISO-8859-1? I was fully > > expecting the import would fail, but somehow it was successful. > > I have no idea. This isn't done by Saxon, it's done by the XML parser. > If you were using the default parser (AElfred), I think that it actually > accepts bytes x80-x9F with encoding="iso-8859-1", converting them into > characters x80-x9F. Windows code pages, e.g. CP 1252, typically encode #x201C, LEFT DOUBLE QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION MARK, as 0x93 and 0x94, respectively. The Windows 2000 "Character Map" utility, for example, shows the characters with those byte values for their encoding when the "Character set" is "Windows: Western" or "Windows: Central Europe", etc. #X201C and #x201D aren't part of ISO 8859-1, so when the encoding really is ISO 8859-1 and not CP 1252 (or similar), then the only way to represent #x201C and #x201D is as numeric character references: “ (or 舠) and ” (or 舡). It appears that AElfred is accommodating the extras in the Windows code page even then the input is labelled ISO-8859-1. Since it used to be said (and may still be true) that some Microsoft software labelled CP 1252 text as ISO 8859-1 (although I thought that Outlook was the main culprit) and since "real" ISO 8859-1 isn't going to use the byte values for the CP 1252 extras (until we get NEL, that is), then it's forgiving of AElfred to accept the extras. It's just that this "principle of least surprise" action surprised several of us. > > Good point. For export output, I changed encoding to > > UTF-8, that seems to have resolved the problem, now > > export is successful. Open the exported CSV in Hex > > editor, those two chars are shown as Hex 93/94, > > respectively. > > > Now I really am puzzled. I'm puzzled too. #x201C is not 0x93 in UTF-8. Regards, Tony Graham ------------------------------------------------------------------------ XML Technology Center - Dublin mailto:tony.graham@xxxxxxx Sun Microsystems Ireland Ltd Phone: +353 1 8199708 Hamilton House, East Point Business Park, Dublin 3 x(70)19708 XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] encoding woes: ISO-8859-1, Michael Kay | Thread | RE: [xsl] encoding woes: ISO-8859-1, Xiaocun Xu |
Re: [xsl] ANN o-iDeveloper, James Fuller | Date | RE: [xsl] XSL for MS LRN 3.0 XML ou, bryan |
Month |