[xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16

Subject: [xsl] Challenges with unparsed-text() and reading UTF-8 file as UTF-16
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Fri, 13 Oct 2006 00:43:01 +0200
Hi List!

This may be a bit of an odd story, and luckily, because of a flaw in the implementation of Saxon of unparsed-text(), there's a workaround. But this is only until Saxon fixes it (maybe I shouldn't report and let it be so? maybe it is not even a bug but liberally designed so?). Here it goes:

Sometimes it happens that one of a dozen or so XML files is saved as UTF-8, with UTF-16 in the prolog. I use the document() function to read them (handy, I can read more files at once). But when this situation happens, there won't be an error, it just returns the empty set. So I thought: for analysis, read it also as text using unparsed-text(). But this wasn't so trivial as I first thought:

1. Guy saves an XML file as UTF-8, but prolog says "utf-16".
2. XSLT started. Some error condition is met. Trying to read the file with unparsed-text()
3. Result: some beautiful Chinese characters
4. Result: trying to find out what goes wrong and why appeared challenging, esp because file shows up normal in text editor

Perhaps this is wrong, perhaps this is right. I think the mistake lies in unparsed-text(). The XSLT specs are quite clear:
1) if you can read it as XML, try to read the encoding from it,
2) otherwise, use supplied encoding in function call
3) otherwise, use UTF-8

My question: why is the file read in the encoding specified in the (illegal) XML file? If (1) fails, and (2) is not there, then (3) should be used. But (3) is never reached...

My guess is: the function tries too hard. Once it encounters the predicate in the XML file, it switches encoding, leaving both the predicate and the rest of the file as total rubbish.

My resolution is a workaround. Because Saxon appears to follow a different order: (2) (1), (3), specifying the encoding by hand helps a bit (but I don't know the encoding in advance either, so I have to do several tests).

By the way, for similar reasons, a file saved as UTF-16 will fail to load if it has a prolog 'utf-8'. But this time, null characters are in the way and a dynamic error is thrown (not with document() function though)

Not sure if I interpret the specs correctly. My main concern is: can I provide the end-users with enough detail when something fails? I had this particular scenario at hand and it tooks us quite some time to find out why our app was silent. (the Java DOM L3 LS Serializer had made a mistake (well, we did) and wrote as the wrong encoding.)

Clearly this is an edge-case. But nevertheless happened a few times already. Anybody any thoughts on this? Is the spec clear enough? Shouldn't (1) and (2) be switched (actually 2 and 3 in specs), like in the Saxon implementation?

-- Abel Braaksma

Current Thread