Subject: Re: [xsl] strange encoding problem From: Mike Brown <mike@xxxxxxxx> Date: Fri, 1 Nov 2002 10:12:39 -0700 (MST) |
Jeni Tennison wrote: Andreas wrote: > > PROBLEM: > > > > when i use tomcat, jsp and the jstl (java standard tag library) to apply the > > transformation > > > > <%@ taglib prefix="x" uri="http://java.sun.com/jstl/xml" %> > > <c:import url="test.xml" var="xml"/> > > <c:import url="test.xsl" var="xsl"/> > > <x:transform xml="${xml}" xslt="${xsl}"/> > > > > the result is ü > > which is NOT correct in my opinion. I'm surprised Tommie isn't scolding you both for straying off topic. > When you say it's ü, do you mean that when you open up > the result you actually see those entity references, or do you see the > actual characters ü? > > I suspect it's the latter, in which case make sure that the text > editor (or whatever) that you're using to look at the result of the > transformation is reading in that result as UTF-8 rather than as > ISO-8859-1. > > If the former, then something really weird's going on -- it looks as > though the result is being serialised as UTF-8, then read as > ISO-8859-1 and then serialised again using HTML entity references. > Perhaps knowing that's what's going on will help you track down the > bug... Yes, it's very typical in servlet/JSP applications to do something like this: 1. The client requests page via HTTP. 2. The server sends an HTML form, wherein the Unicode chars of the document have been serialized in the HTTP response as (iso-8859-1 or local platform default encoding). The response may or may not indicate that this is the encoding, via the charset parameter in the Content-Type header. The client may or may not use the indicated encoding to know how to decode the document and present the form (the user can usually override the decoding on their end, because there is a long history of Japanese and Chinese multibyte character sets being misrepresented as iso-8859-1). 3. Due to convention, not formal standard, the client will try to use the same encoding when it submits the form data, no matter how it is sent (GET, POST, x-www-form-urlencoded or multipart/form-data ... doesn't matter). Unencodable characters in the form data might be first translated to numeric character references... again, there is no standard, so browser behavior varies. The browser most likely will *not* indicate what encoding was used in the form data submission, "for backward compatibility". The form data in the HTTP request may look like this, for example, if you entered a copyright notice consisting of "<copyright symbol U+00A9> 2002 Acme Inc." into a form field named foo on a utf-8 encoded HTML page: foo=%C2%A9%202002%20Acme%20Inc. Note that the copyright symbol character in utf-8 is 0xC2 0xA9, while in iso-8859-1 it is just 0xA9. I recommend monitoring the HTTP traffic so you can see the raw request before making any assumptions. Use a proxy server with extended logging options like Proximitron, or use a packet sniffer like tcpdump and your favorite binary file viewer. e.g., on my BSD box I can use "tcpdump -s 0 -w - port 80 | hexdump -C" 4. The server (servlet/JSP engine like Tomcat or Weblogic) will make an assumption about what encoding was used in the form data. Most likely, it will choose to use iso-8859-1 or whatever the platform default encoding is. Thus it will give you access to what it calls a "parameter" (bad name.. URIs and MIME headers have parameters too, but they aren't the same thing) named foo, containing the Unicode string you get if you decode the URL-encoded bytes as iso-8859-1: roughly, <capital A with carat: U+00C2> <copyright symbol: U+00A9> + "2002 Acme Inc.". Now you can see how things start to go awry. It snowballs from there. The solution I recommend is this: 1. Always know the encoding of the HTML form that you send to the browser. For maximum predictability and Unicode support I recommend using utf-8. Ensure that the HTML declares itself as utf-8 in a meta tag and/or in the HTTP response headers. 2. Make it a requirement for using your application that the browser be set to auto-detect encoding, not override it, so you can assume the form data will come back using the same encoding as the form. OR you can look at the Accept-Charset and/or Accept-Language headers in the HTTP requests to make an intelligent *guess* as to what encoding the browser is using. I don't recommend this because, well, it's still a guess, and you probably wouldn't know when to choose utf-8. 3. If you sent out the form in utf-8, your response is probably coming back utf-8, so take your decoded "parameter", re-encode it as iso-8859-1 bytes, and decode those bytes back as if they were utf-8. Something like this, in Java, plus the appropriate try-catch for the possible UnsupportedEncodingException. String badString = request.getParameter("foo"); byte[] bytes = badString.getBytes("ISO-8859-1"); String goodString = new String(bytes, "UTF-8"); Now, that just covers the general stuff.. I think if you can understand this much of it, you can get to a point where you can figure out how the XSLT transformation output gets munged. Like I said, it really helps if you can peek into the data as it is going back and forth, if you know how to spot faulty data... put lots of System.out.println()s in and get yourself something to look at the HTTP messages with. - Mike ____________________________________________________________________________ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] strange encoding problem, Jeni Tennison | Thread | Re: [xsl] strange encoding problem, Gregory Murphy |
Re: [xsl] element values don't disp, Vasu Chakkera | Date | Re: [xsl] element values don't disp, David Carlisle |
Month |