Hi List!
This may be a bit of an odd story, and luckily, because of a flaw in the
implementation of Saxon of unparsed-text(), there's a workaround. But
this is only until Saxon fixes it (maybe I shouldn't report and let it
be so? maybe it is not even a bug but liberally designed so?). Here it goes:
Sometimes it happens that one of a dozen or so XML files is saved as
UTF-8, with UTF-16 in the prolog. I use the document() function to read
them (handy, I can read more files at once). But when this situation
happens, there won't be an error, it just returns the empty set. So I
thought: for analysis, read it also as text using unparsed-text(). But
this wasn't so trivial as I first thought:
Situation:
1. Guy saves an XML file as UTF-8, but prolog says "utf-16".
2. XSLT started. Some error condition is met. Trying to read the file
with unparsed-text()
3. Result: some beautiful Chinese characters
4. Result: trying to find out what goes wrong and why appeared
challenging, esp because file shows up normal in text editor
Perhaps this is wrong, perhaps this is right. I think the mistake lies
in unparsed-text(). The XSLT specs are quite clear:
1) if you can read it as XML, try to read the encoding from it,
2) otherwise, use supplied encoding in function call
3) otherwise, use UTF-8
My question: why is the file read in the encoding specified in the
(illegal) XML file? If (1) fails, and (2) is not there, then (3) should
be used. But (3) is never reached...
My guess is: the function tries too hard. Once it encounters the
predicate in the XML file, it switches encoding, leaving both the
predicate and the rest of the file as total rubbish.
My resolution is a workaround. Because Saxon appears to follow a
different order: (2) (1), (3), specifying the encoding by hand helps a
bit (but I don't know the encoding in advance either, so I have to do
several tests).
By the way, for similar reasons, a file saved as UTF-16 will fail to
load if it has a prolog 'utf-8'. But this time, null characters are in
the way and a dynamic error is thrown (not with document() function though)
Not sure if I interpret the specs correctly. My main concern is: can I
provide the end-users with enough detail when something fails? I had
this particular scenario at hand and it tooks us quite some time to find
out why our app was silent. (the Java DOM L3 LS Serializer had made a
mistake (well, we did) and wrote as the wrong encoding.)
Clearly this is an edge-case. But nevertheless happened a few times
already. Anybody any thoughts on this? Is the spec clear enough?
Shouldn't (1) and (2) be switched (actually 2 and 3 in specs), like in
the Saxon implementation?
Cheers,
-- Abel Braaksma
http://www.nuntia.com