Re: [xsl] Illegal xml chars

Subject: Re: [xsl] Illegal xml chars
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Thu, 13 Dec 2007 18:10:29 +0100
The purpose of using XML, or of using a standard at all, is that you know that supplier and receiver understand the format and that you need not worry about vendor-specific formats or deviations. XML is a very free language but the standard does dictate that when any document is not well-formed (and an encoding problem means it isn't), that a processor *must* reject it with a fatal error. If you try to bypass that it is like driving in a car with no breaks: some day you will hit a wall and things will crash, and all you thought was that you were driving a real car... it at least looked like one ;)

If you cannot fix the source (i.e., some proprietary legacy home-breed XML-like format which you have to deal with regardless what a standard dictates) it is best to find an agreement with your source of what exactly the difference are (or can be) and agree upon that as strict as you can. Then, decide how to deal with it. Ideally in your situation, I'd choose for a single filter or a filter chain. Many existing workflow systems have that, and if you don't, it's trivial to write one (but don't use XSLT for it, because that expects XML, which you haven't got yet).

After you filter it and you transformed the wannabe XML into proper XML you can start by transforming it with XSLT. Without any hassle, really.

There's only other option I can think of, which will basically come down to the same thing in the end but maybe better extensible: write an encoding parser, call it "almost-utf8", register it, and set the encoding of your document to this home-breed encoding (<?xml version="1.0" encoding="almost-utf8" />. The encoding is just equal to any other UTF-8 except for these characters that you don't allow, which you map to a space or whatever.

But all these methods are far from perfect compared to fixing it at the source. What is the use of using a BS (BackSpace) character in your document anyway?

Cheers,
-- Abel Braaksma



Waqar Ali wrote:
Sorry.. do not want to drag this topic but setting CheckCharacters to false does not work.. Here what is written in the documentation:

"If the XmlReader is processing text data, it always checks that the XML names and text content are valid, regardless of the property setting. Setting CheckCharacters to false turns off character checking for character entity references."

No matter what I do parser does not like this character and I have no option but to somehow take it out from the xml.

Thanks guys for your help.

Current Thread