Subject: RE: [xsl] invalid character (Unicode: 0xa0) in xsl document - LONG From: "Joshua Allen" <joshuaa@xxxxxxxxxxxxx> Date: Sat, 28 Apr 2001 22:41:50 -0700 |
This is correct -- 0xA0 cannot appear as the first byte of a UTF-8 sequence [1]. This character could easily appear as the second byte of a two-byte sequence, and I could also see the error appearing IF you receive a UTF-8 file that does not have a BOM, and is in a different byte-order than your system expects (for example, little-endian, and your system uses big-endian for two-byte sequences). In this case, the parser would (perhaps) assume the preferred byte order, and since 90% of the file is single-byte characters anyway, it would not die until it reaches a sequence that has two bytes or more (perhaps A0E0 in little-endian, your processor would be expecting big-endian, so would expect to see that character as E0A0, and would see instead a character starting with A0 and would throw the error you are seeing). So this error could very well occur when exchanging valid UTF-8 with no BOM between systems with differing byte-orders. Lesson is, always use a BOM :-) Also note that just using an encoding stream that does UTF-8 as suggested below will not solve all of your problems. There are characters which are not valid XML [2], but which are perfectly valid UTF-8. I am not aware of any streamwriters that automatically strip these out for you. [1] http://www.unicode.org/unicode/uni2errata/UTF-8_Corrigendum.html (see table 3.1b) [2] http://www.w3.org/TR/2000/REC-xml-20001006#NT-Char Regards, Joshua > -----Original Message----- > From: Eric Jacobson [mailto:ericjacobson@xxxxxxxxxxxx] > Sent: Saturday, April 28, 2001 8:47 PM > To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx > Subject: Re: [xsl] invalid character (Unicode: 0xa0) in xsl document - > LONG > > The essay below may or may not pertain to your actual problem. However, > it may very likely be that your XML is declaring itself to be > encoded as UTF-8 without that actually being the case. > > jackson wrote: > > > > Alan > > > > > I'm processing an xsl file with the apache xalan 2 processor, and am > > > getting the following error message when i run my application: > > > > > > javax.xml.transform.TransformerConfigurationException: An invalid XML > > > character (Unicode: 0xa0) was found in the element content of the > > > document. > > > > Well, your document says it's UTF-8. I'm not an expert on Unicode > > and related issues, but i think 0xa0, while it is Unicode, is not a > possible > > UTF-8 character. > > > > The character 0xa0 is a non-breaking space. I don't know how > > it might have got in your document (possibly from some HTML?), > > but you could find it and get rid of it. Since it's white space, it's > > not going to be obvious. > > > > You could write a script to look for this character and change > > it - say, to a normal space. You could also do it in your java > > program i suppose, before parsing. > > > > I suppose you could also turn 0xa0 into the UTF-8 equivalent > > (i can't help you there). Java classes might be able to do it for > > you - from what i remember (quite a while ago), there is a class > > for writing to a UTF file? > > > > David Jackson > > > > A brief note before the long-winded part: I suspect you are referring > to the DataInputStream and DataOutputStream classes, which have > methods to readUTF() and writeUTF(). These methods read and write a > modified form of UTF-8 that will not be meaningful to a > standards-compliant processor. Specifying an encoding name to the > constructor of an InputStreamReader or OutputStreamWriter will work, > as will passing an encoding name to the String method getBytes(). > > Your other option is to figure out what encoding your system uses > by default and declare that in the encoding attribute in your XML > prolog. However, the only two encodings required for all XML processors > by the standard are UTF-8 and UTF-16. > > Now for the long part: > > UTF-8 is a method for representing Unicode characters (16 bit values) > on a stream of 8-bit units. Given that a large volume of data is still > primarily composed of the traditional ASCII characters, which require > only 7 bits to represent, using 16 bits per character would be quite > inefficient. UTF-8 uses 8 bits with the sign bit 0 to represent > characters that fall into the ASCII range in a single octet. For > character codes that are larger, more than one byte is used. The leading > bits of the first octet are used to indicate (1) that more than one > octet should be read and (2) how many. The following octets begin > with a pattern that indicates that they are not the start of a > character. > The remaining bits in each octet are then used to hold the actual value > being stored. > > The overall effect is that if your data is all ASCII, the UTF-8 > encoding comes out just like a traditional ASCII file - one > character for every 8-bits. You can create and read such files > with traditional software that never actually heard of UTF-8. > If it uses characters whose codes are > >= 128, it will translate those into multiple octets and a system that > is not making the appropriate interpretations will come up with an > error. > > XML requires all XML processors to > support UTF-8, and the prolog <?xml version="1.0" encoding="UTF-8" ?> > has been added to a great number of XML files as a hard-coded string, > based in part on copying examples. > The data in those files is then generated by a system that may not > be aware of what UTF-8 really means and use some other actual > encoding scheme (Cp1252 aka winAnsi aka Windows-Latin-1, for example). > The end result is that the XML processor expects UTF-8 encoding, > finds a bit pattern that is not valid in UTF-8, and screams. > > In Java, a character is an unsigned 16 bit value containing a > Unicode character code. When reading or writing characters from > 8-bit byte oriented streams or buffers, many Java classes give the > option of specifying the name of an encoding to use and apply a > system default otherwise. The String method getBytes("UTF8") > would return a buffer of bytes representing the String's characters > using the UTF-8 encoding. Alternatively, you could wrap an > OutputStreamWriter around your actual OutputStream with the > encoding set in the constructor. > > Hope this helps. > > Eric Jacobson > > XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] What are these methods?, DPawson | Thread | Re: [xsl] invalid character (Unicod, Mike Brown |
Re: [xsl] invalid character (Unicod, Eric Jacobson | Date | [xsl] matching input value to a nod, tanz |
Month |