Re: A FAQ question about non-Latin characters in XT output

> You use the phrase 'which does not directly encode
> position #x0107 '

probably, I shouldn't have:-(

> Guessing: position hex 107 in the utf-8 list of ?characters?

> What do you mean by 'encode' please?

Stepping back a bit.
The xml/unicode character set consists of the numbered characters
in the range 1 through to hex 10FFFF (with some slots disallowed,
but ignore that for now).

That is the `Universal Character Set (UCS)' 

utf8 is a particular encoding of that range (actually it can encode
the full UCS4 range, up to hex FFFFFFFF, although `only' the first
17 planes of 2^16 characters are currently in Unicode (and only the
first 2^16 characters up to FFFF are in Unicode 2.x)

Note that utf8 is just an `encoding' of the 32bit character number into
1 or more sequences of 8bit bytes, it does not re-order or subset the
available characters.

Now `traditional' encodings like `latin1' or `latin2' or `windows ansi'
or `microsoft code page 850' or the 8bit cyrillic encodings
are subsets of the available characters in UCS (if they are not subsets
they can not be used in XML as the underlying character set in XML is
always unicode). 

> The charset in the xml declaration I believed
> to be one of inclusion/exclusion rather than
> 'encoding'.

No, it's encoding (that's why the syntax is encoding= -)

If you say

<?xml version="1.0" encoding="microsoft-weirdness" ?>

then the available characters and the way they are encoded as bytes
(ie effectively their order) is whatever Bill Gates says it is.
So the byte with value 255 may or may not be y-umlaut (which is what
position 255 is in latin1 and unicode) However the syntax &#255;
(and equivalently &#xFF;) _always_ refers to the unicode numbering
not the current encoding used to decode bytes of character data.


So....

If the encoding is the default utf8 encoding and an XML system wants
to output the character hex 107 (which is c-acute) then 
it can _always_ output it as either
&#x107; or &#253;
however since that is 6 or 7 bytes, if the xml declaration specifies
an encoding for character data that includes this slot then probably
the system will just do that. This is a latin-2 character so if
the encoding is specified as latin-2 then c acute can be encoded in the
single byte with value 230. If the encoding is utf8 then there will
be a two byte representation of character position 263, as shown
in the original posters question.

Since the request in this case was to force the system to use the
character reference form, the actual encoding for the character data
did not matter, as long as this character was _not_ part of the
encoding.

If you pick latin-1 (or ascii, or presumably a cyrillic encoding) then
in that encoding there is no encoding for c-acute ie no encoding for
unicide #x107, so with any of these encodings the only way to get a c
acute is to use &#107; (actually you could use c followed by a combining
acute character, but whether or not that is the same thing depends on
who you are, and what you are doing...)

David


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

<- Previous	Index	Next ->
RE: A FAQ question about non-Latin , DPawson	Thread	A FAQ question about non-Latin char, Jarno Elovirta
RE: FAQ (was Re: how to put images , Thuy Do	Date	Re: FO. lists as tables. Re: Q: XML, Sebastian Rahtz
	Month

<-prev [Thread] next->	<-prev [Date] next->
Month Index \| List Home