RE: [xsl] unparsed-text() and illegal characters

Subject: RE: [xsl] unparsed-text() and illegal characters
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Thu, 27 Jul 2006 20:21:40 +0100
The spec is very strict that characters not allowed in XML cause an error.
This is a change since the book was written.

However, the spec is very loose about how URIs are resolved. So a conformant
product could take the URI

thing.txt?substitute-illegal-chars=FFFD

as a reference to "the document formed by taking thing.txt and substituting
illegal characters with xFFFD."

Perhaps I'll do that.

Michael Kay
http://www.saxonica.com/

 

> -----Original Message-----
> From: Abel Braaksma Online [mailto:abel.online@xxxxxxxxx] 
> Sent: 27 July 2006 19:10
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] unparsed-text() and illegal characters
> 
> Dear List,
> 
> Trying to "import" a non-XML file of an undefined encoding, I 
> received the following error when using Saxon8: "The unparsed 
> text file contains a character illegal in XML (line=1 
> column=4 value=hex 11)". I only found one reference about 
> this error 
> (http://www.stylusstudio.com/xsllist/200510/post90470.html), 
> which is actually a post about illegal characters inside the 
> XSLT document.
> 
> Michael Kay points out in that post that this error is merged 
> into XTDE1190 (see 
> http://www.w3.org/TR/xslt20/#err-XTDE1190). It is claimed in 
> the specs that non-understood characters or byte sequences 
> should result in this non-recoverable dynamic error.
> 
> In his indispensable book, the  XSLT 2.0 Programmer's 
> Reference, he states the following:
> "Some processors will provide configuration options that pass 
> this choice on the user. If the file contains characters that 
> are invalid in XML (this applies to most control characters 
> in the range x00 to x1F under XML 1.0, but only to the null 
> character x00 under XML 1.1) then the invalid characters are 
> substituted by the special Unicode character xFFFD, which is 
> specifically intended for such purposes."
> 
> I understand that the book was written before XSLT 2.0 was 
> finalized (it is still a Candidate), but I wonder if a 
> treatment like above is still possible somehow. The contents 
> of the file is ISO-8859-1, apart from the start and end 
> header, which contain control characters. I only need the 
> part that is parsable as text, the rest can be dismissed.
> 
> Am I asking too much from XSLT, or is this somehow possible? 
> It would really add to the possibilities, and it means I 
> don't need some extra filter or preparse step.
> 
> Cheers,
> Abel Braaksma
> www.nuntia.nl

Current Thread