Re: [xsl] url encoding gets wrong with åöä?

Subject: Re: [xsl] url encoding gets wrong with åöä?
From: Abel Braaksma Online <abel.online@xxxxxxxxx>
Date: Wed, 07 Jun 2006 13:09:29 +0200
Niklas,

Three things to note here:
1. The Unicode codepoint for "LATIN SMALL LETTER A WITH DIAERESIS" (which appears to be the offending character here) is 00 E4.
2. The ISO/IEC 8859-1 codepoint for the Latin equivalent of this, is codepoint E4 in the Latin-1 table.
3. When Unicode is encoded as UTF-8 (which means: all 7-bit chars are same as ISO-8859-1 and have length of one-byte, and the rest is done by some smart algorithm, making the characters length two-byte or three-byte long, and is independent of byte order), Unicode codepoint 00 E4 is encoded as the hexadimal C3 A4 byte sequence.


To test for this, you can do in Windows the following: create a text with the letter d only. Save as ANSI, view as Hexadecimal and you will see a one byte doc as hex E4. Save the same document as UTF-8 (*not* UTF-16 or other multibyte encodings for Unicode!) and you will see, when viewed hexadecimally: C3 A4 byte sequence.

Now for your problem. It is logical to assume that the part of your code that makes up for the text, finds correctly that the entity for "LATIN SMALL LETTER A WITH DIAERESIS" is needed and encoded the text with "&aml;". Which is very nice.

But the code that should make up for the URL, does not do the same trick. I don't have your ASP code here, but I can only assume that something goes wrong there. At the very least, the code sees the input as ISO-8859 and encodes the two-byte UTF-8 sequence as ISO-8859, which, no doubt, goes wrong.

I would suggest you do the following: use the same encoding for your link (if a link is encoded with "&auml;", this will be correctly translated by the browser to the right HTTP escapes). Another option is changing your code in a way that it understands unicode. One thing comes to mind: suppose you also use JScript of JavaScript, the escape() and unescape() functions do not work correctly with Unicode (they are infamous for that fact). Use the newer encodeURI() / decodeURI() instead.

Hope this brings you a bit in the right direction. I haven't read everything in this thread, so I hope I haven't repeated others too much. If I did, I apologize in advance.

Cheers,
Abel





The url commes out as "Avh%C3%A4mtning" and the link text as





"Avh&auml;mtning".
the encoding in the url is wrong. it should be Avh%E4mtning



No, the encoding in the URL is correct. The correct procedure for escaping non-ASCII characters in a URL is to first encode the character in UTF-8, then represent each octet of the UTF-8 sequence in hexadecimal as %HH.


The important question is, does the link actually work? If it doesn't work, which browser are you using?

Michael Kay
http://www.saxonica.com/

Current Thread