RE: [xsl] Unparse-text() string contains ascii chars 29, 30 and 31

Subject: RE: [xsl] Unparse-text() string contains ascii chars 29, 30 and 31
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 19 Oct 2005 17:43:56 +0100
You might be able to make this work by using an XML 1.1 parser (specifying
version="1.1" in the XML declaration). The current Saxon release is a bit
patchy in its support for XML 1.1 (I've been doing some improvements so it
should be better in 8.6) but the basics are there. XML 1.1 allows characters
in the range x01 to x1F provided they are written as character references.
The only character not allowed is 0, which was the result of a coalition
between people who wanted to prevent you holding pure binary, and people who
want to write their software in C.

substring-before is more likely to work than tokenize, because
substring-before allows any string (any string that you can get through the
XML parser, that is), whereas regexes have their own rules and another layer
of parsing. If necessary use translate() to translate the C0 control
characters into PUA Unicode characters, which are legal in a regex.

Michael Kay
http://www.saxonica.com/

> -----Original Message-----
> From: andrew welch [mailto:andrew.j.welch@xxxxxxxxx] 
> Sent: 19 October 2005 16:50
> To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx
> Subject: [xsl] Unparse-text() string contains ascii chars 29, 
> 30 and 31
> 
> I'm trying to process some data that's one long string delimited using
> ascii characters 29, 30 and 31 (which are apparently group, record and
> unit 'separator characters').
> 
> I can get access to the string using unparsed-text(), but when I
> attempt to process the string using any of the function eg:
> 
> tokenize($str, '&#29;')
> 
> or
> 
> substring-before($str, '&#31;')
> 
> ...the XML parser complains that these aren't legal XML characters
> (when the stylesheet itself is parsed).
> 
> Is there any way around this?  I can't see how I can process the
> string in XSLT without using the characters themselves.
> 
> The two alternative's I can see are to use an XMLFilter to turn it
> into XML using Java, or to go back to the source to get them to export
> their data in a less archaic way...

Current Thread