[xsl] codepoints-to-string and string-to-codepoints only support Unicode. Why is that?

Subject: [xsl] codepoints-to-string and string-to-codepoints only support Unicode. Why is that?
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Mon, 12 Feb 2007 13:39:21 +0100
Hi List people,

I happen to have the following requirement: parsing files in a non-unicode encoding that contain numeric escape references to codepoints in a non-unicode codepage. Let me explain with a well-known example as encoding of a URI. The following is an illegal URI:

http://somesite.com/request?city=%C5rhus

which means: http://somesite.com/request?city=Erhus (the first character after city is a Latin Captial Letter A With Ring Above, U+00C5), and should've been encoded as http://somesite.com/request?city=%C3%85rhus to be a correct URI (%C3%85 represents a UTF-8 byte sequence).

In this example, the wrongly encoded URI meant for '%C5' a codepoint in the ISO-8859-1 table at point 197. In this example, this translates directly to U+00C5 by using codepoints-to-string (and some additional translation for the hex-to-dec conversion).

Alas, I have plenty of this stuff, and most them are not URIs, but arbitrary data, and most of the encodings are not even ISO-8859-1, but some other, less common encoding, like ISO-8859-7 (Greek). If the example above were encoded in ISO-8859-7, the C5 would represent the letter Epsilon, U+0395.

My problem is with strings containing hex or decimal numbers pointing to a codepoint in an arbitrary encoding (the encoding is known beforehand). This may look as follows for the Greek encoding:

[C5]psilon
%C5psilon
<C5>psilon

and I would like to translate 'C5' using the 'ISO-8859-7' encoding by a function similar to codepoints-to-string(), but now the codepoint represents a code in some other codepage than Unicode (here: ISO-8859-7).

A lot of functions in XSLT, when dealing with serialization or reading sources, can deal with encodings, but not codepoints-to-string, which only deals with Unicode. Is anyone aware of the reasons why this is done? Why there is no function defined like:

fn:string-to-codepoints($arg as xs:string?, $enc as xs:string?) as xs:integer*

where $enc is the encoding? Is this the result of the encoding bits being part of XSLT where XPath only deals with Unicode? My guess would be that XPath only sees strings of unicode characters and is unaware of any encoding issues, at any time.

Apart from these questions, is anybody aware of a resolution to my problem? Most likely I need an extension function, am I right?

Thanks for any thoughts on this,

Cheers,
-- Abel

Current Thread