Re: [xsl] International Characters in attributes

Subject: Re: [xsl] International Characters in attributes
From: Mike Brown <mike@xxxxxxxx>
Date: Sat, 10 Feb 2001 01:22:06 -0700 (MST)
> we have had problems with the encoding set in our
> documents [affecting] what was displayed in the browser

This might have to do with whether the browser has been properly
configured to detect and use the encoding that you set. You also
do not want to just make up encodings; the document has only one
encoding and if you declare it to be something that it is not, it
is only natural that the browser will misinterpret it.

Since you said you are fuzzy on encoding...

--------------------------------------------------------------------------

The encoding that we are talking about here is the mapping of characters
(which are abstract) to sequences of bits (which are.. less abstract).
Strictly speaking, this is a "character encoding scheme".

Take, for example, the non-breaking space character, which in HTML we
often write as "&nbsp;", a predefined (in HTML, not XML) entity reference
defined as equivalent to "&#160;", which in turn is interpreted as the
single non-breaking-space character. Different encoding schemes will
represent this character as different bit sequences.

For example, in the "iso-8859-1" encoding, the non-breaking space
character maps to the bit sequence 10100000, an 8-bit byte representing a
value that we can also easily express as decimal 160 or hex A0. But in
"utf-8", the non-breaking space maps to the bit sequence 11000010
10100000. If we interpret this as a pair of 8-bit bytes, we could say they
represent the values hex C2 followed by A0 (192 and 160).

Now imagine you are the web browser, receiving an HTTP message containing
an HTML document. All you see in the message is a stream of bits. How do
you know what 1100001010100000 means?

If you think the document is encoded using utf-8, you'll correctly
interpret this sequence as one single NO-BREAK SPACE character (that's its
Unicode name).

If you think the document is encoded using iso-8859-1, you will
incorrectly interpret it as *two* characters: (0xC2) LATIN CAPITAL LETTER
A WITH CIRCUMFLEX followed by (0xA0) NO-BREAK SPACE.

Where do you get info about the document's encoding? Well, there are 3
places to get it:

  - from the transport (e.g., one of the HTTP message headers)

  - from within the document itself (e.g., assume the document is
     us-ascii encoded, read until you find a META tag that is
     intended to mean the same thing as the HTTP header from the
     first option, then reprocess the document using whatever
     encoding was declared there); or

  - by analyzing the bit sequences in the document and making an
     educated guess

The first option is supposed to take highest precedence. I believe that in
the case of HTML documents, the second option has higher precedence, in
practice, even though it is in violation of the relevant specs for it to
do so.

The last option is difficult, but browsers will make a stab at it if
properly configured. XML makes this option much more feasible in XML
parsers than HTML does for HTML user agents because you know an XML
document always begins with the bits for "<?xml ", possibly preceded by a
UTF-16 byte order mark, and if it doesn't then the parser is required to
assume it is UTF-8 or UTF-16 (and it is an error if the document is not!).

If you use the first or second options, you have to be sure that the
encoding being declared is accurate. If you saved your document to disk
from a text editor, it exists on disk as (essentially) a sequence of bits,
so it must have been subjected to some encoding. Your editor might have
given you the option of choosing this encoding. If it didn't, then it
probably stored it using the encoding that is your operating system's
default, which can vary depending on the OS and locale. (e.g. windows-1252
aka cp1252 on USA versions of Windows. You must declare this encoding, or
an encoding that is a superset of it, to be the encoding of the document.

So let's say you have in your HTML document a declaration of the encoding
that was used, and that this declaration is accurate. Whether or not the
browser will actually honor this declaration and decode it appropriately
is an entirely separate matter!

You will find that in many cases, if you have not gone to the trouble of
configuring your browser to auto-detect the encoding, it will proceed
under the assumption that the document is in some default encoding that it
shipped with, or the one that your operating system uses by default.

Above and beyond this, most browsers give you the option of manually
resetting the encoding while you are viewing a page, which really means
you are choosing to *decode* the document's bit sequences according to
that particular scheme.

I saw a post on the Unicode list, I believe, from Microsoft explaining
that in one of their 3.x browsers they either had auto-detect on by
default, or they didn't allow users to override the encoding.. and this
resulted in innumerable complaints from people who could no longer view
web pages with misdeclared encodings. Apparently there are a lot of
Shift-JIS documents out there claiming to be ISO-8859-1, and people needed
to be able to make the browser ignore these misdeclarations by default so
their surfing experience wouldn't be traumatic.

FWIW, http://www.hclrss.demon.co.uk/unicode/ contains some good info
that compares encoding support in various browsers.

   - Mike
____________________________________________________________________
Mike J. Brown, software engineer at            My XML/XSL resources: 
webb.net in Denver, Colorado, USA              http://skew.org/xml/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread