RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16

Subject: RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 13 Oct 2006 09:23:41 +0100
> Well, no HTTP server here. Just a text file read in from 
> disk. But I guess the media types as registered by the 
> operating system also count?

I think there are always going to be some differences in interpretation of
the specs in this kind of area, which is one reason why I think it's a good
thing that there's now explicit allowance for implementation-defined
heuristics.

The rule does say "external encoding information" rather than "HTTP
headers", and that would certainly cover files read from a decent operating
system that records encoding information reliably in the metadata for a
file. Unfortunately most of us are using indecent operating systems without
any reliable metadata (in fact, without any metadata at all). An
implementation could interpret this rule as allowing "I know this file is on
an AS400 and I know that on an AS400 the default file encoding is
IBM-EBCDIC". But I would hope that this would be done under rule 4
(implementation-defined heuristics) rather than under rule 1 (external
encoding information).

> > Firstly, there's been one late change to the spec in this 
> area. It's 
> > now more permissive, it allows the processor to try harder.
> 
> I didn't read that from the specs. But then, I still haven't 
> read every corner of it ;-)

It's not yet in the specs. The decision was reported in

http://www.w3.org/Bugs/Public/show_bug.cgi?id=3728

> >     
> >       3b: if there's something that looks like an [ASCII] XML 
> > declaration with an encoding attribute, use that
> >   
> This is where things go wrong, I think. It appears as if 
> Saxon indeed finds the XML declaration in either file, and 
> uses it. 

Indeed. Heuristics is a fancy name for guesswork. Guesses won't always give
the right answer. Indeed, when examining files that contain false
information about their own encoding, it's a recipe for getting the wrong
answer.

I think that what you really need for your application is unparsed-binary();
the only problem with that is that the type system and F+O have very limited
facilities for handling binary data.

Michael Kay
http://www.saxonica.com/

Current Thread