Re: [xsl] How to read the encoding of an XML document

Subject: Re: [xsl] How to read the encoding of an XML document
From: "Christopher R. Maden" <crism@xxxxxxxxx>
Date: Thu, 25 Oct 2001 14:46:54 -0700
At 14:18 25-10-2001, James Garriss wrote:
I've been looking at a lot of European web pages, viewing source to see what charset they define in the HTML META tag. The majority use iso-8859-1, but a few don't. Most notably Turkey and Greece have character sets that are quite different. How do I determine if UTF-16 (or UTF-8) will work for those languages?

Time for the primer again.


A character is an abstract notion, like "Latin capital letter A".

A character repertoire is a collection of characters - like "Latin upper-case letters". Different languages require different character repertoires.

A character set is an ordered, numbered character repertoire. ISO 8859-1 is one such character set, assigning numbers 0-255 to 256 characters. Its repertoire covers nearly all of the characters needed for western European languages like French, Spanish, German, and Italian, as well as English, Icelandic, Swedish, Norwegian, and Dutch. There are other ISO 8859 character sets that cover characters needed by other languages like Turkish, Polish, Greek, Russian, Hebrew, and Arabic.

Unicode is also a character set. It assigns the numbers 0 - (2^32)-1 to a whole lot of characters. Its repertoire includes all of the characters covered in other national and International Standards, including all of the ISO 8859 sets.

An encoding is a mapping of bit patterns to a character set. UTF-8 and UTF-16 are encodings of Unicode. In a sense, ISO 8859-1 and its kin are also encodings of Unicode, but ones that can not represent all of the characters.

In short: Unless you are working in Klingon, Minbari, or Silvestri, Unicode covers the characters you need in its repertoire. UTF-8 and UTF-16 are both capable of representing all of the characters in Unicode. All XML parsers are required to read UTF-8 and UTF-16 data.

Use them. Know them. Love them.

-Chris
--
Christopher R. Maden, Principal Consultant, HMM Consulting Int'l, Inc.
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://www.hmmci.com/ > <URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4  5DFC AC52 F825 AFEC 58DA


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread