Subject: Re: A FAQ question about non-Latin characters in XT output From: David Carlisle <davidc@xxxxxxxxx> Date: Wed, 13 Oct 1999 10:07:19 +0100 (BST) |
> You use the phrase 'which does not directly encode > position #x0107 ' probably, I shouldn't have:-( > Guessing: position hex 107 in the utf-8 list of ?characters? > What do you mean by 'encode' please? Stepping back a bit. The xml/unicode character set consists of the numbered characters in the range 1 through to hex 10FFFF (with some slots disallowed, but ignore that for now). That is the `Universal Character Set (UCS)' utf8 is a particular encoding of that range (actually it can encode the full UCS4 range, up to hex FFFFFFFF, although `only' the first 17 planes of 2^16 characters are currently in Unicode (and only the first 2^16 characters up to FFFF are in Unicode 2.x) Note that utf8 is just an `encoding' of the 32bit character number into 1 or more sequences of 8bit bytes, it does not re-order or subset the available characters. Now `traditional' encodings like `latin1' or `latin2' or `windows ansi' or `microsoft code page 850' or the 8bit cyrillic encodings are subsets of the available characters in UCS (if they are not subsets they can not be used in XML as the underlying character set in XML is always unicode). > The charset in the xml declaration I believed > to be one of inclusion/exclusion rather than > 'encoding'. No, it's encoding (that's why the syntax is encoding= -) If you say <?xml version="1.0" encoding="microsoft-weirdness" ?> then the available characters and the way they are encoded as bytes (ie effectively their order) is whatever Bill Gates says it is. So the byte with value 255 may or may not be y-umlaut (which is what position 255 is in latin1 and unicode) However the syntax ÿ (and equivalently ÿ) _always_ refers to the unicode numbering not the current encoding used to decode bytes of character data. So.... If the encoding is the default utf8 encoding and an XML system wants to output the character hex 107 (which is c-acute) then it can _always_ output it as either ć or ý however since that is 6 or 7 bytes, if the xml declaration specifies an encoding for character data that includes this slot then probably the system will just do that. This is a latin-2 character so if the encoding is specified as latin-2 then c acute can be encoded in the single byte with value 230. If the encoding is utf8 then there will be a two byte representation of character position 263, as shown in the original posters question. Since the request in this case was to force the system to use the character reference form, the actual encoding for the character data did not matter, as long as this character was _not_ part of the encoding. If you pick latin-1 (or ascii, or presumably a cyrillic encoding) then in that encoding there is no encoding for c-acute ie no encoding for unicide #x107, so with any of these encodings the only way to get a c acute is to use k (actually you could use c followed by a combining acute character, but whether or not that is the same thing depends on who you are, and what you are doing...) David XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: A FAQ question about non-Latin , DPawson | Thread | A FAQ question about non-Latin char, Jarno Elovirta |
RE: FAQ (was Re: how to put images , Thuy Do | Date | Re: FO. lists as tables. Re: Q: XML, Sebastian Rahtz |
Month |