Re: [xsl] strange encoding problem

Subject: Re: [xsl] strange encoding problem
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 1 Nov 2002 10:12:39 -0700 (MST)
Jeni Tennison wrote:
Andreas wrote:
> > PROBLEM:
> >
> > when i use tomcat, jsp and the jstl (java standard tag library) to apply the
> > transformation
> >
> > <%@ taglib prefix="x" uri="http://java.sun.com/jstl/xml"; %>
> > <c:import url="test.xml" var="xml"/>
> > <c:import url="test.xsl" var="xsl"/>
> > <x:transform xml="${xml}" xslt="${xsl}"/>
> >
> > the result is &Atilde;&frac14;
> > which is NOT correct in my opinion.

I'm surprised Tommie isn't scolding you both for straying off topic.

> When you say it's &Atilde;&frac14;, do you mean that when you open up
> the result you actually see those entity references, or do you see the
> actual characters ü?
> 
> I suspect it's the latter, in which case make sure that the text
> editor (or whatever) that you're using to look at the result of the
> transformation is reading in that result as UTF-8 rather than as
> ISO-8859-1.
> 
> If the former, then something really weird's going on -- it looks as
> though the result is being serialised as UTF-8, then read as
> ISO-8859-1 and then serialised again using HTML entity references.
> Perhaps knowing that's what's going on will help you track down the
> bug...

Yes, it's very typical in servlet/JSP applications to do something like this:

1. The client requests page via HTTP.

2. The server sends an HTML form, wherein the Unicode chars of the document 
   have been serialized in the HTTP response as (iso-8859-1 or local platform 
   default encoding). The response may or may not indicate that this is the 
   encoding, via the charset parameter in the Content-Type header.
   The client may or may not use the indicated encoding to know how to 
   decode the document and present the form (the user can usually override
   the decoding on their end, because there is a long history of Japanese
   and Chinese multibyte character sets being misrepresented as iso-8859-1).

3. Due to convention, not formal standard, the client will try to use the same
   encoding when it submits the form data, no matter how it is sent (GET,
   POST, x-www-form-urlencoded or multipart/form-data ... doesn't matter).
   Unencodable characters in the form data might be first translated to
   numeric character references... again, there is no standard, so browser
   behavior varies. The browser most likely will *not* indicate what encoding
   was used in the form data submission, "for backward compatibility".

   The form data in the HTTP request may look like this,
   for example, if you entered a copyright notice consisting of 
   "<copyright symbol U+00A9> 2002 Acme Inc." into a form field named
   foo on a utf-8 encoded HTML page:  foo=%C2%A9%202002%20Acme%20Inc.
   Note that the copyright symbol character in utf-8 is 0xC2 0xA9, while
   in iso-8859-1 it is just 0xA9. I recommend monitoring the HTTP
   traffic so you can see the raw request before making any assumptions.
   Use a proxy server with extended logging options like Proximitron, or
   use a packet sniffer like tcpdump and your favorite binary file viewer.
   e.g., on my BSD box I can use "tcpdump -s 0 -w - port 80 | hexdump -C"

4. The server (servlet/JSP engine like Tomcat or Weblogic) will make an
   assumption about what encoding was used in the form data. Most likely,
   it will choose to use iso-8859-1 or whatever the platform default
   encoding is. Thus it will give you access to what it calls a
   "parameter" (bad name.. URIs and MIME headers have parameters too,
   but they aren't the same thing) named foo, containing the Unicode
   string you get if you decode the URL-encoded bytes as iso-8859-1:
   roughly, <capital A with carat: U+00C2> <copyright symbol: U+00A9>
   + "2002 Acme Inc.".

Now you can see how things start to go awry. It snowballs from there.
The solution I recommend is this:

1. Always know the encoding of the HTML form that you send to the browser. For
maximum predictability and Unicode support I recommend using utf-8. Ensure
that the HTML declares itself as utf-8 in a meta tag and/or in the HTTP
response headers.

2. Make it a requirement for using your application that the browser be set to
auto-detect encoding, not override it, so you can assume the form data will
come back using the same encoding as the form. OR you can look at the
Accept-Charset and/or Accept-Language headers in the HTTP requests to make an
intelligent *guess* as to what encoding the browser is using. I don't
recommend this because, well, it's still a guess, and you probably wouldn't
know when to choose utf-8.

3. If you sent out the form in utf-8, your response is probably coming back
utf-8, so take your decoded "parameter", re-encode it as iso-8859-1 bytes, and
decode those bytes back as if they were utf-8. Something like this, in Java,
plus the appropriate try-catch for the possible UnsupportedEncodingException.

String badString = request.getParameter("foo");
byte[] bytes = badString.getBytes("ISO-8859-1");
String goodString = new String(bytes, "UTF-8");

Now, that just covers the general stuff.. I think if you can understand this
much of it, you can get to a point where you can figure out how the XSLT
transformation output gets munged. Like I said, it really helps if you can
peek into the data as it is going back and forth, if you know how to spot
faulty data... put lots of System.out.println()s in and get yourself something
to look at the HTTP messages with.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread