Re: [xsl] Problems with characters

Subject: Re: [xsl] Problems with characters
From: Mike Brown <mike@xxxxxxxx>
Date: Wed, 20 Feb 2002 00:27:45 -0700 (MST)
Ragulf Pickaxe wrote:
> I have a problem with characters using characterset 8859-1.

You have more problems than that :)

> Rather than displaying the Danish characters of ?,? and ? the presentation 
> is the following:
> ? instead of ?
> ? instead of ?
> ? is depicted as ?

As you can see, my email software did not like the fact that your email
contained bytes outside the ASCII range (ASCII = 00-7F, and that's being
generous) and that your email failed to declare what character set to use when
interpreting these bytes.

Let's look at your email with a hex editor:

 72 73 65 74 20 38 38 35  39 2d 31 2e 0a 0a 52 61  |rset 8859-1...Ra|
 74 68 65 72 20 74 68 61  6e 20 64 69 73 70 6c 61  |ther than displa|
 79 69 6e 67 20 74 68 65  20 44 61 6e 69 73 68 20  |ying the Danish |
 63 68 61 72 61 63 74 65  72 73 20 6f 66 20 bf 2c  |characters of ¿,|
 b8 20 61 6e 64 20 e5 20  74 68 65 20 70 72 65 73  |¸ and å the pres|
 65 6e 74 61 74 69 6f 6e  20 0a 69 73 20 74 68 65  |entation .is the|
 20 66 6f 6c 6c 6f 77 69  6e 67 3a 0a e6 20 69 6e  | following:.æ in|
 73 74 65 61 64 20 6f 66  20 bf 0a f8 20 69 6e 73  |stead of ¿.ø ins|
 74 65 61 64 20 6f 66 20  b8 0a e5 20 69 73 20 64  |tead of ¸.å is d|
 65 70 69 63 74 65 64 20  61 73 20 e5 0a 0a 49 73  |epicted as å..Is|

OK, on the left are the hex notations for the bytes, and on the right are the
raw bytes. On the fourth line, the 2nd-to-last byte is BF, which on my
terminal looks like an upside-down question mark. I happen to know that in
iso-8859-1, the upside-down question mark is byte A1, so we can safely assume
that what I see and what you see may very well be two completely different
things :)

Therefore, I cannot even begin to answer your questions, because I have no
idea what characters you think you were typing in your email. If you go to
http://www.eki.ee/letter/chardata.cgi?ucode=00a0-00ff you will probably find
the info you seek, and you will also find the official Unicode names for these
characters (e.g. "LATIN CAPITAL LETTER A WITH RING ABOVE") and their Unicode
code points (e.g. "00C5", which would be written "U-000000C5" or if you say
"U+00C5" it's not completely accurate but people will know what you mean),
either of which will help you effectively communicate what characters you are
talking about.

> I get my data from an SQL-database and transform it twice, both with
> <?xml version="1.0" encoding="ISO-8859-1"?>

Well.. that's not saying much. The encoding declaration in an XML document is
saying "the bytes in this document map to Unicode characters according to the
iso-8859-1 character map". It is expected to be a truthful assertion, and is
only for the XML parser's benefit. You need more info about exactly what bytes
are going into the database, what bytes are coming out, and then what you're
doing with them after that. There are many possible points of failure, and I
suspect you may be corrupting or losing encoding information for what goes
into your database in the first place...

> I suspect it is the conversion of data from SQL database characterset to 
> output/stylesheet characterset, but I don't know what to do about it.

You should educate yourself about encoding issues, and then trace your
character data through its entire lifetime from its creation to its storage to
its transmission and interpretation... every step of the way introduces the
possibility of confusion with respect to encoding.

Good luck.

http://skew.org/xml/tutorial/ explains encoding w.r.t. XML
http://skew.org/xml/links/ has many good encoding related links

   - Mike
____________________________________________________________________________
  mike j. brown, fourthought.com  |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread