RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16

Subject: RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Fri, 13 Oct 2006 00:45:08 +0100
> I use the 
> document() function to read them (handy, I can read more 
> files at once). But when this situation happens, there won't 
> be an error, it just returns the empty set. 

Actually, that's up to you: in Saxon it's configurable whether this error is
recovered or not.

> So I
> thought: for analysis, read it also as text using 
> unparsed-text(). But this wasn't so trivial as I first thought:
> 
> Situation:
> 1. Guy saves an XML file as UTF-8, but prolog says "utf-16".
> 2. XSLT started. Some error condition is met. Trying to read 
> the file with unparsed-text() 3. Result: some beautiful 
> Chinese characters 4. Result: trying to find out what goes 
> wrong and why appeared challenging, esp because file shows up 
> normal in text editor

You're being a bit too concise hear, it's not clear to me what's going on.
> 
> Perhaps this is wrong, perhaps this is right. I think the 
> mistake lies in unparsed-text(). The XSLT specs are quite clear:
> 1) if you can read it as XML, try to read the encoding from it,
> 2) otherwise, use supplied encoding in function call
> 3) otherwise, use UTF-8

I think you're trying to abbreviate the text, which is reasonable, but in
doing so you've misrepresented it. Rule 1 (which is actually rule 2 in the
spec) does not say "if you can read it as XML", it says "if the media type
of the resource is text/xml or application/xml (see [RFC2376]), or if it
matches the conventions text/*+xml or application/*+xml (see [RFC3023]
and/or its successors)". So if an HTTP server serves up a non-XML document
with an application/xml media type, this rule is going to kick in.

> 
> My question: why is the file read in the encoding specified in the
> (illegal) XML file? 

If you're asking why the spec is as it is, the answer is (a) for
compatibility with XInclude, and (b) for use cases where you want to use
unparsed-text() to read XML/XHTML without parsing it.
> 
> My resolution is a workaround. Because Saxon appears to 
> follow a different order: (2) (1), (3), specifying the 
> encoding by hand helps a bit (but I don't know the encoding 
> in advance either, so I have to do several tests).

Firstly, there's been one late change to the spec in this area. It's now
more permissive, it allows the processor to try harder. In particular, it
allows (but does not require) the processor to use other heuristics, such as
the presence of a byte order mark, or "magic numbers", etc, within the
content of the file itself. Saxon in fact does take account of the byte
order mark if present.

It would be useful here if we discuss the actual text of the rules in their
original numbering:

   1. external encoding information is used if available, otherwise

   2. if the media type of the resource is text/xml or application/xml (see
[RFC2376]), or if it matches the conventions text/*+xml or application/*+xml
(see [RFC3023] and/or its successors), then the encoding is recognized as
specified in [XML 1.0], otherwise

   3. the value of the $encoding argument is used if present, otherwise

   4. [new] the processor may use implementation-defined heuristics to
determine the encoding, otherwise

   5. UTF-8 is assumed.

It's true that Saxon doesn't currently implement this quite as written. The
sequence currently followed by Saxon is in essence:

   1. the value of the $encoding argument is used if present, otherwise

   2. if the file is being read using the HTTP protocol, get the encoding
from the HTTP headers, otherwise

   3. read the beginning of the file:

      3a: if there's a UTF-16 or UTF-8 byte order mark, assume it's correct,
otherwise
    
      3b: if there's something that looks like an [ASCII] XML declaration
with an encoding attribute, use that

      3c: if the first four even-numbered bytes are zero, assume UTF-16BE

      3d: if the first four odd-numbered bytes are zero, assume UTF-16LE

   4. otherwise assume UTF-8.

> Shouldn't (1) and (2) be switched (actually 2 and 3 in 
> specs), like in the Saxon implementation?

I think this is one of those cases where I implemented an early version of
the spec, hoping it would get better in time; as indicated above there has
been one welcome change which legitimizes some of the Saxon differences, but
the WG decided to stick with the rule that the HTTP header takes precedence
over the encoding attribute, and I will have to change to conform with that.
I don't really understand all the background here, but it's all to do with
browser history: popular browsers try to outguess the HTTP headers, and the
specs disapprove, and W3C is trying to hold its ground in the battle. It's a
bigger issue than XSLT, in other words.

Michael Kay
http://www.saxonica.com/

Current Thread