Re: [xsl] Need to remove unusual character in source

Subject: Re: [xsl] Need to remove unusual character in source
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Wed, 27 Sep 2006 01:02:52 +0200
Mario Madunic wrote:
the character is and its a control character

0x18 CAN

Unfortunately, that says it all. Control characters are not allowed in UTF-8 and as a result, are not allowed in XML, when the encoding is UTF-8 (making XML not well-formed)


the error message I recieve is
SXXP0003: Error reported by XML parser: Illegal XML character: &#x18;.

This is indeed illegal. The other day I accidentally used &#x08;, which is also illegal (I had it mistaken for a tab character, x09, which *is* legal) .


I've tried using ANT to clean it out but with no luck using native2ascii or
escapeunicode

Won't help either. Escaping these characters will not help. But you are on the right track: use a filter to remove this character, or replace it with something useful. I use a filter to get Micrososft *.msg format, which has some useful lines, but the rest are control characters and other illegal data. Here's what it might look like when you'd resort to using Ruby (you can call it from Ant if you like), see www.ruby-lang.org.


(spoiler warning: this is off-topic and only marginally related to xslt)


# create working dir if not FileTest::exist?('trimmed') Dir.mkdir('trimmed') end

Dir.entries(".").each do |fn|
if fn =~ /\.yourextension/
# open file and set it to binmode
file = File.new(fn)
file.binmode
# read complete file contents and scan it
newfile = File.new("trimmed/#{fn}.txt", 'w')
file.gets(nil).scan(/[^\x18]+/m) do |found|
newfile.puts(found);
end
end
end



Just replace "yourextension" with the extension of your file and replace "trimmed" with an output dirname of your choice. Replace '.txt" with whatever extension you would like yourself. It runs through the current directory and copies all files to the "trimmed" directory, with one change: the x18 character is removed.


Of course, you can use Perl, a DOS Batch file (takes some practice), Bash, VBScript, PHP, Grep, Awk or any other tool you'd prefer.

HTH,

Cheers,
Abel Braaksma
http://abelleba.metacarpus.com



Can this be done or do I need to ask the client to remove it from their data,
which might not be an option?

Any help or insight would be greatly appreciated.

Marijan Madunic

Current Thread