Re: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16
Subject: Re: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16|
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 13 Oct 2006 02:28:45 +0200
Once more, many thanks for a quick and thorough reply. You must have 48
hours in a day ;-)
Please see my additions below
Michael Kay wrote:
Actually, that's up to you: in Saxon it's configurable whether this error is
recovered or not.
Thanks, I'll look it up, it may come in very handy. I wonder though what
the result will be when calling it on a node set, like this:
<xsl:copy-of select="document($configuration//resource/@url)" />
when some of the @url are not pointing to valid resources.
You're being a bit too concise hear, it's not clear to me what's going on.
I see. Actually, I was trying to keep my story to a readable size. But
to get to the actual problem, I think this more clearly describes it (a
test file, I used it find out what was going on):
<?xml version="1.0" encoding="UTF-8"?>
xmlns:xs = "http://www.w3.org/2001/XMLSchema"
<xsl:output indent="yes" />
<xsl:copy-of select="local:getfile('testUTF8.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF16.xml', ())" />
select="local:getfile('testUTF8-with-16-in-prolog.xml', ())" />
select="local:getfile('testUTF16-with-8-in-prolog.xml', ())" />
<xsl:copy-of select="local:getfile('testUTF8.xml', 'utf-16')" />
<xsl:copy-of select="local:getfile('testUTF16.xml', 'utf-8')" />
select="local:getfile('testUTF8-with-16-in-prolog.xml', 'utf-8')" />
select="local:getfile('testUTF16-with-8-in-prolog.xml', 'utf-16')" />
<xsl:value-of select="unparsed-text('testUTF8.xml')" />
<xsl:value-of select="unparsed-text('testUTF16.xml')" />
<xsl:copy-of select="document('testUTF8-with-16-in-prolog.xml')" />
<xsl:copy-of select="document('testUTF16-with-8-in-prolog.xml')" />
<xsl:param name="filename" />
<xsl:param name="encoding" />
<xsl:variable name="unp-available" as="xs:boolean"
else unparsed-text-available($filename, $encoding)" />
<available><xsl:value-of select="$unp-available" /></available>
In fact, this runs a series of tests. The last for value-of/copy-of may
fail and throw an error. The first eight or so (with getfile) should
never throw an error.
This is what actually happened this morning: one of our programmers was
testing and suddenly had the export part of the system producing
nothing, or inconclusive results. After quite a while, we found out the
the XML Serialization method that we used, was changed. We switched to
DOM L3, using latest version of Xerces, to make use of the new Load/Save
additions (which in turn was done to get rid of the rather ridiculous
way of dealing with namespace-serialization when we serialized it the
The output file looked perfect and as such had a header not unfamiliar
to people with some XML knowledge. The content was perfectly well
<?xml version="1.0" encoding="UTF-16"?>
BUT! (after more research) The application failed because this file was
actually serialized to disk using UTF-8. So, the XML Prolog and the
actual encoding did not match.
Because we already used unparsed-text(), I was very surprised to find
out that that particular function tried to read it as XML. But, reading
the specs, that was so required. But it yielded some unexpected results.
If you would like to do the tests yourself, it is easy enough to create
the malformed XML files. Just create the normal way. Open in a Unicode
aware editor (NOT an XML aware editor!) and save as UTF-8 when UTF-16 is
in the prolog and vice versa (or I can send the malformed set and/or
place it online).
I think you're trying to abbreviate the text, which is reasonable, but in
doing so you've misrepresented it.
Well, actually, I wanted to simplify the discussion. But, reading on on
your comments, I understand that that's not feasible in this context.
Rule 1 (which is actually rule 2 in the
spec) does not say "if you can read it as XML", it says "if the media type
of the resource is text/xml or application/xml (see [RFC2376]), or if it
matches the conventions text/*+xml or application/*+xml (see [RFC3023]
and/or its successors)". So if an HTTP server serves up a non-XML document
with an application/xml media type, this rule is going to kick in.
Well, no HTTP server here. Just a text file read in from disk. But I
guess the media types as registered by the operating system also count?
My question: why is the file read in the encoding specified in the
(illegal) XML file?
If you're asking why the spec is as it is, the answer is (a) for
compatibility with XInclude, and (b) for use cases where you want to use
unparsed-text() to read XML/XHTML without parsing it.
No, I am basically trying to find out what to expect in this odd
circumstance (and since users may add there own files, it very well may
happen again that this scenario appears)
Firstly, there's been one late change to the spec in this area. It's now
more permissive, it allows the processor to try harder.
I didn't read that from the specs. But then, I still haven't read every
corner of it ;-)
4. [new] the processor may use implementation-defined heuristics to
determine the encoding, otherwise
I looked here: http://www.w3.org/TR/xslt20/#unparsed-text and it's not
yet added. So, this is *very* hot of the press?
It's true that Saxon doesn't currently implement this quite as written.
Lucky, lucky me in this scenario, but I am afraid you will change that
later, so I can't (and should not) rely on that.
sequence currently followed by Saxon is in essence:
1. the value of the $encoding argument is used if present, otherwise
2. if the file is being read using the HTTP protocol, get the encodingCan't verify. I use a URI (was earlier discussion) with 'file://' etc.
So no HTTP here.
from the HTTP headers, otherwise
3. read the beginning of the file:Checked: no BOM.for either crapped file (resp. 3C, 3F and 00 3C 00 3F,
which is standard start of '<?' in either encoding)
3a: if there's a UTF-16 or UTF-8 byte order mark, assume it's correct,
3b: if there's something that looks like an [ASCII] XML declarationThis is where things go wrong, I think. It appears as if Saxon indeed
finds the XML declaration in either file, and uses it. To my surprise,
it does not check the result of this, which is illegal XML (it tries to
read an UTF-8 encoded file as UTF-16 because the (utf8/ascii) XML
declaration contains UTF-16. Which cannot be correct if you can read the
prolog as UTF-8. These two exclude each other. So, 3b is only partially
feasible, I think.
with an encoding attribute, use that
3c: if the first four even-numbered bytes are zero, assume UTF-16BEThat's my case in the other scenario, where a UTF-16 file has an XML
declaration with UTF-8 in it. Again (see before) this poses a
controversy: the XML declaration cannot be read as UTF-16, contain (in
UTF-16) a declaration with UTF-8 and then suddenly be UTF-8.
3d: if the first four odd-numbered bytes are zero, assume UTF-16LEIt never gets here if the xml declaration is misformed.
4. otherwise assume UTF-8.
I don't really understand all the background here, but it's all to do with
browser history: popular browsers try to outguess the HTTP headers, and the
specs disapprove, and W3C is trying to hold its ground in the battle. It's a
bigger issue than XSLT, in other words.
sounds like a lot of politics to me. Never new that such a tiny thing
could come from such a huge factor ;-)
Thanks and cheers,
-- Abel Braaksma