Subject: RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16 From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Fri, 13 Oct 2006 09:23:41 +0100 |
> Well, no HTTP server here. Just a text file read in from > disk. But I guess the media types as registered by the > operating system also count? I think there are always going to be some differences in interpretation of the specs in this kind of area, which is one reason why I think it's a good thing that there's now explicit allowance for implementation-defined heuristics. The rule does say "external encoding information" rather than "HTTP headers", and that would certainly cover files read from a decent operating system that records encoding information reliably in the metadata for a file. Unfortunately most of us are using indecent operating systems without any reliable metadata (in fact, without any metadata at all). An implementation could interpret this rule as allowing "I know this file is on an AS400 and I know that on an AS400 the default file encoding is IBM-EBCDIC". But I would hope that this would be done under rule 4 (implementation-defined heuristics) rather than under rule 1 (external encoding information). > > Firstly, there's been one late change to the spec in this > area. It's > > now more permissive, it allows the processor to try harder. > > I didn't read that from the specs. But then, I still haven't > read every corner of it ;-) It's not yet in the specs. The decision was reported in http://www.w3.org/Bugs/Public/show_bug.cgi?id=3728 > > > > 3b: if there's something that looks like an [ASCII] XML > > declaration with an encoding attribute, use that > > > This is where things go wrong, I think. It appears as if > Saxon indeed finds the XML declaration in either file, and > uses it. Indeed. Heuristics is a fancy name for guesswork. Guesses won't always give the right answer. Indeed, when examining files that contain false information about their own encoding, it's a recipe for getting the wrong answer. I think that what you really need for your application is unparsed-binary(); the only problem with that is that the type system and F+O have very limited facilities for handling binary data. Michael Kay http://www.saxonica.com/
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Challenges with unparsed-, Abel Braaksma | Thread | Re: [xsl] Challenges with unparsed-, Colin Paul Adams |
RE: [xsl] pairing up similar tags b, Webmaster | Date | Re: [xsl] pairing up similar tags b, Andrew Welch |
Month |