Re: [xsl] 8bit ascii encoding

Subject: Re: [xsl] 8bit ascii encoding
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 23 Aug 2002 09:43:31 -0600 (MDT)
David Carlisle wrote:
> > If my chars are two bytes each then Im using utf-16, but utf-8 can
> > consist of 1-5bytes per char... I think I need to read some more.
> grrr.. Your char(acters) don't have any bytes they are just characters
> (aka unicode code points in the range hex 0 - 10FFFF).

(I'll just supplement what you said, for Andrew's benefit)

i.e. a character is abstract, it is "the idea of the Latin (script)'s letter e
with acute accent", not an actual glyph of an e with an accent that looks like
a short forward slash, and not a particular byte (or series of bytes). Unicode
just gives these ideas names and numbers..  (hex) E9 = LATIN SMALL LETTER E
WITH ACUTE. Encoding schemes like us-ascii, iso-8859-1, utf-8, utf-16 are what
give you representations of these numbers (and hence the unambiguous idea of
the characters) as bits/bytes in specific sequences. It is when these bytes
are decoded and interpreted to produce some visual representations of
characters on your computer's display device that you run into the majority of
your problems.

Also, I'd point out that in Java the 'char' datatype is in theory representing
a Unicode character by code point (just a number in the above range), but in
fact it is implemented as a utf-16 code value (and endianness is underlying
platform dependent I believe). In C I believe you have a choice of what char
means but typically it's a similar situation. (someone C-literate can clarify)
... so saying 'char' can be misleading sometimes.

> utf8
>   encode the character as a sequence of 1-5 bytes, using a simple and
>   funky bit slicing mechanism that has the following properties.

This little conversion chart can help you visualize it without getting into
the nitty-gritty of bit slicing:

A very common problem that people run into is they look at UTF-8 encoded HTML
through an editor, browser, or terminal window that is unaware that the
encoding is UTF-8. The software is making the assumption that the bytes it is
interpreting are iso-8859-1 or windows-1252 encoded (or some such; in any case
it's just wrong). The user sees all the ASCII characters just fine, but an
"accented" character or anything in Unicode above code point 127 shows up as
two characters in the editor/browser/terminal.

In the case of an editor/terminal you just need to get a smarter editor or
just learn to live with the fact that é (Unicode xE9) is going to look like é
because the UTF-8 bytes for that are C3 A9.

For a browser looking at HTML you need to make sure there is an appropriate
META tag in the document head, or else make sure it is being served with the
right Content-Type: text/html;charset=utf-8. And then you also have to make
sure the browser has been configured to honor this info; so many pages have
misdeclared encodings that the browser makers have had to let the user force
the assumed encoding.

> > At the moment, Im using an xml output method with ascii encoding, and
> > telling IE the encoding is utf-8 (in the meta),
> So you are sitting on a time bomb

Nah. In theory that shouldn't present a problem. If the output really is ascii
(one 8-bit bytes per character, and the high bit is always 0), the browser can
safely (though wrongly) assume any encoding that's an ascii superset, which is
pretty much anything except utf-16. Although technically it is a
misdeclaration and XML 1.0 says that's a fatal error.

I suspect his real problem is that his output is not really ascii, it's
utf-16, and that it's the recurring FAQ about MSXML when you buffer the output
in a string rather than a (document?) object. Either that or it's utf-8 and
his browser is ignoring the meta.


 XSL-List info and archive:

Current Thread