At 14:18 25-10-2001, James Garriss wrote:
I've been looking at a lot of European web pages, viewing source to see
what charset they define in the HTML META tag. The majority use
iso-8859-1, but a few don't. Most notably Turkey and Greece have
character sets that are quite different. How do I determine if UTF-16 (or
UTF-8) will work for those languages?
Time for the primer again.
A character is an abstract notion, like "Latin capital letter A".
A character repertoire is a collection of characters - like "Latin
upper-case letters". Different languages require different character
repertoires.
A character set is an ordered, numbered character repertoire. ISO 8859-1
is one such character set, assigning numbers 0-255 to 256 characters. Its
repertoire covers nearly all of the characters needed for western European
languages like French, Spanish, German, and Italian, as well as English,
Icelandic, Swedish, Norwegian, and Dutch. There are other ISO 8859
character sets that cover characters needed by other languages like
Turkish, Polish, Greek, Russian, Hebrew, and Arabic.
Unicode is also a character set. It assigns the numbers 0 - (2^32)-1 to a
whole lot of characters. Its repertoire includes all of the characters
covered in other national and International Standards, including all of the
ISO 8859 sets.
An encoding is a mapping of bit patterns to a character set. UTF-8 and
UTF-16 are encodings of Unicode. In a sense, ISO 8859-1 and its kin are
also encodings of Unicode, but ones that can not represent all of the
characters.
In short: Unless you are working in Klingon, Minbari, or Silvestri, Unicode
covers the characters you need in its repertoire. UTF-8 and UTF-16 are
both capable of representing all of the characters in Unicode. All XML
parsers are required to read UTF-8 and UTF-16 data.
Use them. Know them. Love them.
-Chris
--
Christopher R. Maden, Principal Consultant, HMM Consulting Int'l, Inc.
DTDs/schemas - conversion - ebooks - publishing - Web - B2B - training
<URL: http://www.hmmci.com/ > <URL: http://crism.maden.org/consulting/ >
PGP Fingerprint: BBA6 4085 DED0 E176 D6D4 5DFC AC52 F825 AFEC 58DA
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list