Dear List,
Trying to "import" a non-XML file of an undefined encoding, I received
the following error when using Saxon8: "The unparsed text file contains
a character illegal in XML (line=1 column=4 value=hex 11)". I only found
one reference about this error
(http://www.stylusstudio.com/xsllist/200510/post90470.html), which is
actually a post about illegal characters inside the XSLT document.
Michael Kay points out in that post that this error is merged into
XTDE1190 (see http://www.w3.org/TR/xslt20/#err-XTDE1190). It is claimed
in the specs that non-understood characters or byte sequences should
result in this non-recoverable dynamic error.
In his indispensable book, the XSLT 2.0 Programmer's Reference, he
states the following:
"Some processors will provide configuration options that pass this
choice on the user. If the file contains characters that are invalid in
XML (this applies to most control characters in the range x00 to x1F
under XML 1.0, but only to the null character x00 under XML 1.1) then
the invalid characters are substituted by the special Unicode character
xFFFD, which is specifically intended for such purposes."
I understand that the book was written before XSLT 2.0 was finalized (it
is still a Candidate), but I wonder if a treatment like above is still
possible somehow. The contents of the file is ISO-8859-1, apart from the
start and end header, which contain control characters. I only need the
part that is parsable as text, the rest can be dismissed.
Am I asking too much from XSLT, or is this somehow possible? It would
really add to the possibilities, and it means I don't need some extra
filter or preparse step.
Cheers,
Abel Braaksma
www.nuntia.nl