Re: [xsl] 0x19 is not a legal XML character
Subject: Re: [xsl] 0x19 is not a legal XML character|
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 28 Jun 2007 12:46:10 +0200
Andrew Welch wrote:
On 6/28/07, Abel Braaksma <abel.online@xxxxxxxxx> wrote:
this may work and will remove all offending U+0019 chars.
The "offending" u+0019 characters could well be good content that's
being written/read in the wrong encoding.
True, but if I remember correctly, then all ISO-646 characters (the
ancient ASCII ones, before 0x80) are written as is in UTF-8, all
ISO-8859-x, CPxxx windows/dos encodings, TIS-620, Shift-JIS, GB2312 etc.
The only notable exceptions are, I believe, the IBM EBCDIC encodings
(but IBM500 is most often used, which has the End Of Medium right at
0x19 as well). None of these encodings, not even the EBCDIC ones, use
the 0x19 for a diacritic.
Just trying to state that: I think it is very unlikely that encoding
alone (read or write) will be the culprit here (which is often a culprit
though for higher characters).
Of course, it can be valid content, in which case the XML documents
should be opened as XML 1.1 documents.
Simply stripping them out probably isn't the best approach - you need
to work out why they're there, what put them there and then fix that.
Patching it up afterwards is never a good idea.
agreed, just wanted to show how it can be done in XSLT, if you (the OP)
felt a need for it.
Imagine explaining your process to someone else in a years time -
"this step is where we remove the u+0019 characters".
Good design starts at the sources.