Re: [xsl] Need to remove unusual character in source

Subject: Re: [xsl] Need to remove unusual character in source
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Wed, 27 Sep 2006 20:12:11 +0200
David Carlisle wrote:
Unfortunately, that says it all. Control characters are not allowed in UTF-8 and as a result, are not allowed in XML, when the encoding is UTF-8 (making XML not well-formed)

Not so, utf8 can encode control characters, but they are not allowed in XML 1.0 (whatever the encoding)

David

Colin Adams wrote:
Unfortunately, that says it all. Control characters are not allowed in UTF-8 and as a result,

Oh yes they are!

You are all so alert! Like I said to Florent earlier today: I shouldn't post too late anymore. Yet, reading these posts, I had to look it up to find out the details, just of curiosity. From Unicode Standard 4.0 (I know, XML requires at least v3.1), it says in chapter 15.1, and I quote:


"There are 65 code points set aside in the Unicode Standard for compatibility with the C0 and C1 control codes [....] U+0000 - U+001F, U+007F, U+0080 - U+009F."

Reading on reveals that when you use UTF-8, they will be represented as their hexadecimal value <03> for x03 etc, padded with one NUL for UTF-16 and thre NULs in UTF-32. Meaning that the hexadecimal appearance of x08 indeed is legal in UTF-8 (note that for the higher range, UTF-8 will encode to a two-byte sequence).

Thanks for pointing me to this.

Cheers,

-- Abel Braaksma
  http://abelleba.metacarpus.com

Current Thread