Subject: RE: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16 From: "Michael Kay" <mike@xxxxxxxxxxxx> Date: Fri, 13 Oct 2006 00:45:08 +0100 |
> I use the > document() function to read them (handy, I can read more > files at once). But when this situation happens, there won't > be an error, it just returns the empty set. Actually, that's up to you: in Saxon it's configurable whether this error is recovered or not. > So I > thought: for analysis, read it also as text using > unparsed-text(). But this wasn't so trivial as I first thought: > > Situation: > 1. Guy saves an XML file as UTF-8, but prolog says "utf-16". > 2. XSLT started. Some error condition is met. Trying to read > the file with unparsed-text() 3. Result: some beautiful > Chinese characters 4. Result: trying to find out what goes > wrong and why appeared challenging, esp because file shows up > normal in text editor You're being a bit too concise hear, it's not clear to me what's going on. > > Perhaps this is wrong, perhaps this is right. I think the > mistake lies in unparsed-text(). The XSLT specs are quite clear: > 1) if you can read it as XML, try to read the encoding from it, > 2) otherwise, use supplied encoding in function call > 3) otherwise, use UTF-8 I think you're trying to abbreviate the text, which is reasonable, but in doing so you've misrepresented it. Rule 1 (which is actually rule 2 in the spec) does not say "if you can read it as XML", it says "if the media type of the resource is text/xml or application/xml (see [RFC2376]), or if it matches the conventions text/*+xml or application/*+xml (see [RFC3023] and/or its successors)". So if an HTTP server serves up a non-XML document with an application/xml media type, this rule is going to kick in. > > My question: why is the file read in the encoding specified in the > (illegal) XML file? If you're asking why the spec is as it is, the answer is (a) for compatibility with XInclude, and (b) for use cases where you want to use unparsed-text() to read XML/XHTML without parsing it. > > My resolution is a workaround. Because Saxon appears to > follow a different order: (2) (1), (3), specifying the > encoding by hand helps a bit (but I don't know the encoding > in advance either, so I have to do several tests). Firstly, there's been one late change to the spec in this area. It's now more permissive, it allows the processor to try harder. In particular, it allows (but does not require) the processor to use other heuristics, such as the presence of a byte order mark, or "magic numbers", etc, within the content of the file itself. Saxon in fact does take account of the byte order mark if present. It would be useful here if we discuss the actual text of the rules in their original numbering: 1. external encoding information is used if available, otherwise 2. if the media type of the resource is text/xml or application/xml (see [RFC2376]), or if it matches the conventions text/*+xml or application/*+xml (see [RFC3023] and/or its successors), then the encoding is recognized as specified in [XML 1.0], otherwise 3. the value of the $encoding argument is used if present, otherwise 4. [new] the processor may use implementation-defined heuristics to determine the encoding, otherwise 5. UTF-8 is assumed. It's true that Saxon doesn't currently implement this quite as written. The sequence currently followed by Saxon is in essence: 1. the value of the $encoding argument is used if present, otherwise 2. if the file is being read using the HTTP protocol, get the encoding from the HTTP headers, otherwise 3. read the beginning of the file: 3a: if there's a UTF-16 or UTF-8 byte order mark, assume it's correct, otherwise 3b: if there's something that looks like an [ASCII] XML declaration with an encoding attribute, use that 3c: if the first four even-numbered bytes are zero, assume UTF-16BE 3d: if the first four odd-numbered bytes are zero, assume UTF-16LE 4. otherwise assume UTF-8. > Shouldn't (1) and (2) be switched (actually 2 and 3 in > specs), like in the Saxon implementation? I think this is one of those cases where I implemented an early version of the spec, hoping it would get better in time; as indicated above there has been one welcome change which legitimizes some of the Saxon differences, but the WG decided to stick with the rule that the HTTP header takes precedence over the encoding attribute, and I will have to change to conform with that. I don't really understand all the background here, but it's all to do with browser history: popular browsers try to outguess the HTTP headers, and the specs disapprove, and W3C is trying to hold its ground in the battle. It's a bigger issue than XSLT, in other words. Michael Kay http://www.saxonica.com/
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
[xsl] Challenges with unparsed-text, Abel Braaksma | Thread | Re: [xsl] Challenges with unparsed-, Abel Braaksma |
Re: [xsl] De-Duplification revisite, Jay Bryant | Date | Re: [xsl] How to find distinct node, Victor Toni |
Month |