Subject: Re: [xsl] invalid character (Unicode: 0xa0) in xsl document - LONG From: Eric Jacobson <ericjacobson@xxxxxxxxxxxx> Date: Sat, 28 Apr 2001 23:47:24 -0400 |
The essay below may or may not pertain to your actual problem. However, it may very likely be that your XML is declaring itself to be encoded as UTF-8 without that actually being the case. jackson wrote: > > Alan > > > I'm processing an xsl file with the apache xalan 2 processor, and am > > getting the following error message when i run my application: > > > > javax.xml.transform.TransformerConfigurationException: An invalid XML > > character (Unicode: 0xa0) was found in the element content of the > > document. > > Well, your document says it's UTF-8. I'm not an expert on Unicode > and related issues, but i think 0xa0, while it is Unicode, is not a possible > UTF-8 character. > > The character 0xa0 is a non-breaking space. I don't know how > it might have got in your document (possibly from some HTML?), > but you could find it and get rid of it. Since it's white space, it's > not going to be obvious. > > You could write a script to look for this character and change > it - say, to a normal space. You could also do it in your java > program i suppose, before parsing. > > I suppose you could also turn 0xa0 into the UTF-8 equivalent > (i can't help you there). Java classes might be able to do it for > you - from what i remember (quite a while ago), there is a class > for writing to a UTF file? > > David Jackson > A brief note before the long-winded part: I suspect you are referring to the DataInputStream and DataOutputStream classes, which have methods to readUTF() and writeUTF(). These methods read and write a modified form of UTF-8 that will not be meaningful to a standards-compliant processor. Specifying an encoding name to the constructor of an InputStreamReader or OutputStreamWriter will work, as will passing an encoding name to the String method getBytes(). Your other option is to figure out what encoding your system uses by default and declare that in the encoding attribute in your XML prolog. However, the only two encodings required for all XML processors by the standard are UTF-8 and UTF-16. Now for the long part: UTF-8 is a method for representing Unicode characters (16 bit values) on a stream of 8-bit units. Given that a large volume of data is still primarily composed of the traditional ASCII characters, which require only 7 bits to represent, using 16 bits per character would be quite inefficient. UTF-8 uses 8 bits with the sign bit 0 to represent characters that fall into the ASCII range in a single octet. For character codes that are larger, more than one byte is used. The leading bits of the first octet are used to indicate (1) that more than one octet should be read and (2) how many. The following octets begin with a pattern that indicates that they are not the start of a character. The remaining bits in each octet are then used to hold the actual value being stored. The overall effect is that if your data is all ASCII, the UTF-8 encoding comes out just like a traditional ASCII file - one character for every 8-bits. You can create and read such files with traditional software that never actually heard of UTF-8. If it uses characters whose codes are >= 128, it will translate those into multiple octets and a system that is not making the appropriate interpretations will come up with an error. XML requires all XML processors to support UTF-8, and the prolog <?xml version="1.0" encoding="UTF-8" ?> has been added to a great number of XML files as a hard-coded string, based in part on copying examples. The data in those files is then generated by a system that may not be aware of what UTF-8 really means and use some other actual encoding scheme (Cp1252 aka winAnsi aka Windows-Latin-1, for example). The end result is that the XML processor expects UTF-8 encoding, finds a bit pattern that is not valid in UTF-8, and screams. In Java, a character is an unsigned 16 bit value containing a Unicode character code. When reading or writing characters from 8-bit byte oriented streams or buffers, many Java classes give the option of specifying the name of an encoding to use and apply a system default otherwise. The String method getBytes("UTF8") would return a buffer of bytes representing the String's characters using the UTF-8 encoding. Alternatively, you could wrap an OutputStreamWriter around your actual OutputStream with the encoding set in the constructor. Hope this helps. Eric Jacobson XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] invalid character (Unicod, jackson | Thread | RE: [xsl] invalid character (Unicod, Michael Kay |
[xsl] What are these methods?, Yang | Date | RE: [xsl] invalid character (Unicod, Joshua Allen |
Month |