Subject: RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8 From: Xiaocun Xu <xiaocunxu@xxxxxxxxx> Date: Wed, 24 Jul 2002 07:23:13 -0700 (PDT) |
--- Tony Graham <Tony.Graham@xxxxxxx> wrote: > Michael Kay wrote at 24 Jul 2002 09:05:31 +0100: > > > > ISO-8859-1 can only encode the characters in > the > > > > range 0-255. > > > > > > That's what I thought as well. How did saxon > > > converted those two control chars into the > proper > > > encoding for ¡° and ¡± even though > the input > > > XML was marked as encoding in ISO-8859-1? I > was fully > > > expecting the import would fail, but somehow it > was successful. > > > > I have no idea. This isn't done by Saxon, it's > done by the XML parser. > > If you were using the default parser (AElfred), I > think that it actually > > accepts bytes x80-x9F with encoding="iso-8859-1", > converting them into > > characters x80-x9F. > > Windows code pages, e.g. CP 1252, typically encode > #x201C, LEFT DOUBLE > QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION > MARK, as 0x93 and > 0x94, respectively. > > The Windows 2000 "Character Map" utility, for > example, shows the > characters with those byte values for their encoding > when the > "Character set" is "Windows: Western" or "Windows: > Central Europe", > etc. > > #X201C and #x201D aren't part of ISO 8859-1, so when > the encoding > really is ISO 8859-1 and not CP 1252 (or similar), > then the only way > to represent #x201C and #x201D is as numeric > character references: > “ (or Ås) and ¡± (or ô). > > It appears that AElfred is accommodating the extras > in the Windows > code page even then the input is labelled > ISO-8859-1. Since it used > to be said (and may still be true) that some > Microsoft software > labelled CP 1252 text as ISO 8859-1 (although I > thought that Outlook > was the main culprit) and since "real" ISO 8859-1 > isn't going to use > the byte values for the CP 1252 extras (until we get > NEL, that is), > then it's forgiving of AElfred to accept the extras. > It's just that > this "principle of least surprise" action surprised > several of us. Thanks for the explanation, that made a lot of sense, sounds like the entire MSOffice suite are culprit, if not more. If this is only allow by AElfred, I guess I really have to resolve this problem when I am upgrading to Saxon7.x and XercesJ2. > > > Good point. For export output, I changed > encoding to > > > UTF-8, that seems to have resolved the problem, > now > > > export is successful. Open the exported CSV in > Hex > > > editor, those two chars are shown as Hex 93/94, > > > respectively. > > > > > Now I really am puzzled. > > I'm puzzled too. #x201C is not 0x93 in UTF-8. Very strange indeed. I checked the hex values stored in SQLServer after import, both chars are stored as , the quotation mark in ISO-8859-1. How did it transpose these characters to ] and ^ on export? Even I marked the export proprietary XML as UTF-8, Saxon/AElfred had no problem processing it. To consistently use UTF-8 for encoding, for import Excel CSV, I guess I need to run native2ascii before I start XSLT transformation. But what happens on export? Open CSV in hex editor and it uses one byte per char, how could the export generate CSV with “ and ” chars? Thanks, Xiaocun __________________________________________________ Do You Yahoo!? Yahoo! Health - Feel better, live better http://health.yahoo.com XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] encoding woes: ISO-8859-1, Tony Graham | Thread | [xsl] Can not convert #RTREEFRAG to, Ming |
RE: [xsl] Errors when trying to use, Andrew Welch | Date | [xsl] Image Scaling in XSL:FO, Prince Ohilip |
Month |