RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8

Subject: RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8
From: Tony Graham <Tony.Graham@xxxxxxx>
Date: Wed, 24 Jul 2002 11:10:59 +0100
Michael Kay wrote at 24 Jul 2002 09:05:31 +0100:
 > > > ISO-8859-1 can only encode the characters in the
 > > > range 0-255.
 > > 
 > > That's what I thought as well.  How did saxon
 > > converted those two control chars into the proper
 > > encoding for &#8220; and &#8221; even though the input
 > > XML was marked as encoding in ISO-8859-1?  I was fully 
 > > expecting the import would fail, but somehow it was successful.
 > 
 > I have no idea. This isn't done by Saxon, it's done by the XML parser.
 > If you were using the default parser (AElfred), I think that it actually
 > accepts bytes x80-x9F with encoding="iso-8859-1", converting them into
 > characters x80-x9F.

Windows code pages, e.g. CP 1252, typically encode #x201C, LEFT DOUBLE
QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION MARK, as 0x93 and
0x94, respectively.

The Windows 2000 "Character Map" utility, for example, shows the
characters with those byte values for their encoding when the
"Character set" is "Windows: Western" or "Windows: Central Europe",
etc.

#X201C and #x201D aren't part of ISO 8859-1, so when the encoding
really is ISO 8859-1 and not CP 1252 (or similar), then the only way
to represent #x201C and #x201D is as numeric character references:
&#x201C (or &#x8220;) and &#x201D; (or &#x8221;).

It appears that AElfred is accommodating the extras in the Windows
code page even then the input is labelled ISO-8859-1.  Since it used
to be said (and may still be true) that some Microsoft software
labelled CP 1252 text as ISO 8859-1 (although I thought that Outlook
was the main culprit) and since "real" ISO 8859-1 isn't going to use
the byte values for the CP 1252 extras (until we get NEL, that is),
then it's forgiving of AElfred to accept the extras.  It's just that
this "principle of least surprise" action surprised several of us.

 > > Good point.  For export output, I changed encoding to
 > > UTF-8, that seems to have resolved the problem, now
 > > export is successful.  Open the exported CSV in Hex
 > > editor, those two chars are shown as Hex 93/94,
 > > respectively.
 > > 
 > Now I really am puzzled.

I'm puzzled too. #x201C is not 0x93 in UTF-8.

Regards,


Tony Graham
------------------------------------------------------------------------
XML Technology Center - Dublin                mailto:tony.graham@xxxxxxx
Sun Microsystems Ireland Ltd                       Phone: +353 1 8199708
Hamilton House, East Point Business Park, Dublin 3            x(70)19708

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread