Subject: Re: [xsl] Perl workaround for XSLT escaped UTF-8 pass-thru to XHTML From: Gan Uesli Starling <alias@xxxxxxxxxxx> Date: Fri, 21 Mar 2003 23:08:21 -0500 |
Your stylesheet is hard-coded to create the following element in the HTML output, after the <title>, even though the output encoding might not actually be UTF-8:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
Your stylesheet contains the following hint to the processor as to how to serialize the result tree:
<xsl:output method="html"/>
Because you did not specify a preferred encoding for output, (e.g. <xsl:output method="html" encoding="UTF-8"/>), MSXML is going to default to UTF-16.
Okay, I tried that. Did not work for me. Am not using MS-anything for XSLT here but Apache Xalan on a NetBSD Unix box. Result of doing as you suggest with my tools is here...
Further, since in your example you are invoking the transformation via an xml-stylesheet PI, rather than from script in an HTML document or separate msxml.dll-using code altogether (e.g. msxsl.exe), the browser is invoking MSXML in a manner that forces the output to be UTF-16 *regardless* of what encoding was asked for in xsl:output.
In either case, following the guidelines for HTML output, MSXML adds the following *before* the <title>:
<META http-equiv="Content-Type" content="text/html; charset=UTF-16">
As an experiment I changed the delcaration to UTF-16 and had a peek with both MSIE and Mozilla but results were the same. So I put it back to UTF-8.
You can see this if you have installed the "tools for viewing and validating XSL output" (or whatever they're calling it today) from the MSDN site.
Running a BSD here. Have a Win2K box for my son to play games on, and I do VNC to it to cross-check my pages for XHTML. Otherwise I'm always on Unix (pardon '*nix').
Since this META comes first, the browser ignores the one that says UTF-8. This is good, because your HTML isn't UTF-8 encoded, it is UTF-16 encoded. The document's bytes have to be decoded into characters properly or you'll get a big nonsensical mess. So the UTF-16 bytes are mapped back into characters properly, and the markup is parsed and rendered appropriately.
Hm, you keep saying 'META' tag versus 'meta' tag. My understanding was that XHTML required all lower-case tags. So I tried it with META but that too was no go.
If I put encoding="utf-8" in the xsl:output instruction, and invoke the transform without using the xml-stylesheet PI, such as via msxsl.exe, I can get actual UTF-8 output that contains a generated META that says UTF-8.
So you should get rid of that hard-coded meta tag... the processor will generate the correct one.
Java Xalan appears to issue no such tag. Having found none is why I hard coded it.
Now as for your other problem, I fail to see what the problem is with having encoded characters in the document. If the document is UTF-8 or UTF-16 encoded, there's no need to ever use a numeric character reference, so long as the document or delivery protocol properly declares the encoding. The browser will handle it just fine, as long as the user has not overridden the encoding (which, sadly, they are allowed to do).
Is on an Apache 2.0 server on NetBSD 1.6 in my home. May be that Apache is defaulting to UTF-8 for XML and not for HTML? Strangely, with the meta declaration in place for UTF-8 I get the proper characters for escaped UTF-8 (versus bare UTF-8) on the pages for BOTH browsers no matter what may say the Encoding in the pull-down window for either.
I have observed this for a long time since many of my XHTML pages (hand coded not XSLT) from a year past are in Esperanto. That is why I lean toward numeric character references.
If you really want numeric character references anyway, then why use UTF-8 or UTF-16 output? Output ASCII instead, and the serializer will take care of generating NCRs for you. No need to run the document through a perl script.
<xsl:output method="html" encoding="us-ascii"/>
I may try that next except that would be nice to employ a Unicode-aware text editor such as mined or Yudit for the original.
It's ironic that the document you are rendering attempts to explain some of these issues.
For example, in the document's content, you imply that HTML doesn't "use Unicode" and that XML does. This is wrong. Both HTML 4.0 and XML 1.0 consist of a string of Unicode characters, taken from almost exactly the same repertoire (all of Unicode, minus a handful of control and non-characters). In both HTML and XML, these characters manifest as bytes according to the same set of encodings: those approved for Internet use by the IANA (utf-8, utf-16, iso-8859-1, etc.). Both HTML and XML use either decimal or hexadecimal character references (sequences of universally encodable characters like & # 1 6 0 ;) to represent single Unicode characters that typically wouldn't be supported in the document's actual encoding. The only real difference between XML and HTML, character-wise, is in how the documents declare their encoding internally (meta tag vs prolog), in whether it is a fatal error for the encoding declaration to be wrong, and in the fact that HTML defines many more built-in entities (nbsp, etc.) to represent characters.
Some old browsers (notably Netscape 4.x) are horribly nonconformant with regard to encoding and how they interpret both entity references and numeric character references, but they're not something you should be putting too much effort into working around.
I try to keep them in mind. But I only really test against MSIE and Mozilla in their latest-and-greatest versions.
And at the bottom of the document, you tell people to switch to UTF-8, but changing the encoding (decoding, actually) is not an option in IE, when viewing browser-rendered XSLT output (i.e. when rendering an XML doc that contains an xml-stylesheet PI pointing to an XSLT stylesheet).
I will re-word it. Meant for only when viewing *.html file resulting from XSLT of an XML care of Xalan.
You might also be interested in this little snippet:
Looks interesting. Appears to be somewhat MS-specific. Suppose I could tweak it for a more broad-spectrum report?
<script type="text/javascript"><!-- if ( navigator.userAgent.toLowerCase().indexOf("msie") != -1 && (parseInt(navigator.appVersion) >= 4) ) { document.write( "<p><tt>Since you are using IE 4.0 or higher and do not have scripting disabled, I can tell that this generated HTML document is being intepreted by the browser as <u>" + document.charset + "</u> and that the browser's default encoding happens to be <u>" + document.defaultCharset + "</u>.</tt></p>" ); } //--></script>
Thanks for your input. I will experiment further and adapt as much as I can make to work for both Mozilla and MSIE.
<(+)__ Gan Uesli Starling ((__/)=- Kalamazoo, MI, USA `||` ++ http://starling.us
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] Perl workaround for XSLT , Mike Brown | Thread | Re: [xsl] Perl workaround for XSLT , Gan Uesli Starling |
Re: [xsl] Perl workaround for XSLT , Mike Brown | Date | [xsl] sum() and param, Blanche Angelo |
Month |