[xsl] unparsed-text() and illegal characters

Subject: [xsl] unparsed-text() and illegal characters
From: Abel Braaksma Online <abel.online@xxxxxxxxx>
Date: Thu, 27 Jul 2006 20:10:22 +0200
Dear List,

Trying to "import" a non-XML file of an undefined encoding, I received the following error when using Saxon8: "The unparsed text file contains a character illegal in XML (line=1 column=4 value=hex 11)". I only found one reference about this error (http://www.stylusstudio.com/xsllist/200510/post90470.html), which is actually a post about illegal characters inside the XSLT document.

Michael Kay points out in that post that this error is merged into XTDE1190 (see http://www.w3.org/TR/xslt20/#err-XTDE1190). It is claimed in the specs that non-understood characters or byte sequences should result in this non-recoverable dynamic error.

In his indispensable book, the XSLT 2.0 Programmer's Reference, he states the following:
"Some processors will provide configuration options that pass this choice on the user. If the file contains characters that are invalid in XML (this applies to most control characters in the range x00 to x1F under XML 1.0, but only to the null character x00 under XML 1.1) then the invalid characters are substituted by the special Unicode character xFFFD, which is specifically intended for such purposes."


I understand that the book was written before XSLT 2.0 was finalized (it is still a Candidate), but I wonder if a treatment like above is still possible somehow. The contents of the file is ISO-8859-1, apart from the start and end header, which contain control characters. I only need the part that is parsable as text, the rest can be dismissed.

Am I asking too much from XSLT, or is this somehow possible? It would really add to the possibilities, and it means I don't need some extra filter or preparse step.

Cheers,
Abel Braaksma
www.nuntia.nl

Current Thread