Re: [xsl] Perl workaround for XSLT escaped UTF-8 pass-thru to XHTML

Subject: Re: [xsl] Perl workaround for XSLT escaped UTF-8 pass-thru to XHTML
From: Gan Uesli Starling <alias@xxxxxxxxxxx>
Date: Fri, 21 Mar 2003 23:08:21 -0500
Mike Brown wrote:
Your stylesheet is hard-coded to create the following element in the HTML
output, after the <title>, even though the output encoding might not actually
be UTF-8:

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Your stylesheet contains the following hint to the processor as to how to
serialize the result tree:

<xsl:output method="html"/>

Because you did not specify a preferred encoding for output, (e.g. <xsl:output
method="html" encoding="UTF-8"/>), MSXML is going to default to UTF-16.

Okay, I tried that. Did not work for me. Am not using MS-anything for XSLT here but Apache Xalan on a NetBSD Unix box. Result of doing as you suggest with my tools is here...

http://starling.ws/foo.html

Further, since in your example you are invoking the transformation via an
xml-stylesheet PI, rather than from script in an HTML document or separate
msxml.dll-using code altogether (e.g. msxsl.exe), the browser is invoking
MSXML in a manner that forces the output to be UTF-16 *regardless* of what
encoding was asked for in xsl:output.

In either case, following the guidelines for HTML output, MSXML adds the
following *before* the <title>:

<META http-equiv="Content-Type" content="text/html; charset=UTF-16">

I'm not sure what you mean by 'PI' but the rest makes sense. So I put this...


<meta http-equiv="Content-Type"content="text/html; charset=UTF-8">

...BEFORE the <title></title>. I went with UTF-8 versus UTF-16 to match your prior suggestion above...so that they agree one to the other. Alas,
still no go.


As an experiment I changed the delcaration to UTF-16 and had a peek with
both MSIE and Mozilla but results were the same. So I put it back to
UTF-8.

You can see this if you have installed the "tools for viewing and validating
XSL output" (or whatever they're calling it today) from the MSDN site.

Running a BSD here. Have a Win2K box for my son to play games on, and I do VNC to it to cross-check my pages for XHTML. Otherwise I'm always on Unix (pardon '*nix').

Since this META comes first, the browser ignores the one that says UTF-8. This
is good, because your HTML isn't UTF-8 encoded, it is UTF-16 encoded. The
document's bytes have to be decoded into characters properly or you'll get a
big nonsensical mess. So the UTF-16 bytes are mapped back into characters
properly, and the markup is parsed and rendered appropriately.

Hm, you keep saying 'META' tag versus 'meta' tag. My understanding was that XHTML required all lower-case tags. So I tried it with META but that too was no go.

If I put encoding="utf-8" in the xsl:output instruction, and invoke the
transform without using the xml-stylesheet PI, such as via msxsl.exe, I can
get actual UTF-8 output that contains a generated META that says UTF-8.

So you should get rid of that hard-coded meta tag... the processor will
generate the correct one.

Java Xalan appears to issue no such tag. Having found none is why I hard coded it.

Now as for your other problem, I fail to see what the problem is with having
encoded characters in the document. If the document is UTF-8 or UTF-16
encoded, there's no need to ever use a numeric character reference, so long as
the document or delivery protocol properly declares the encoding. The browser
will handle it just fine, as long as the user has not overridden the encoding
(which, sadly, they are allowed to do).

Is on an Apache 2.0 server on NetBSD 1.6 in my home. May be that Apache is defaulting to UTF-8 for XML and not for HTML? Strangely, with the meta declaration in place for UTF-8 I get the proper characters for escaped UTF-8 (versus bare UTF-8) on the pages for BOTH browsers no matter what may say the Encoding in the pull-down window for either.

I have observed this for a long time since many of my XHTML pages (hand
coded not XSLT) from a year past are in Esperanto. That is why I lean
toward numeric character references.

If you really want numeric character references anyway, then why use UTF-8 or
UTF-16 output? Output ASCII instead, and the serializer will take care of
generating NCRs for you. No need to run the document through a perl script.

<xsl:output method="html" encoding="us-ascii"/>

I may try that next except that would be nice to employ a Unicode-aware text editor such as mined or Yudit for the original.

It's ironic that the document you are rendering attempts to explain some of
these issues.

For example, in the document's content, you imply that HTML doesn't "use
Unicode" and that XML does. This is wrong. Both HTML 4.0 and XML 1.0 consist
of a string of Unicode characters, taken from almost exactly the same
repertoire (all of Unicode, minus a handful of control and non-characters). In
both HTML and XML, these characters manifest as bytes according to the same
set of encodings: those approved for Internet use by the IANA (utf-8, utf-16,
iso-8859-1, etc.). Both HTML and XML use either decimal or hexadecimal character
references (sequences of universally encodable characters like & # 1 6 0 ;) to
represent single Unicode characters that typically wouldn't be supported in
the document's actual encoding. The only real difference between XML and HTML,
character-wise, is in how the documents declare their encoding internally
(meta tag vs prolog), in whether it is a fatal error for the encoding
declaration to be wrong, and in the fact that HTML defines many more built-in
entities (nbsp, etc.) to represent characters.

Am regularly checking the HTML output against both MSIE 6 SP 1 under Win2K and on Mozilla 1.2.1 under NetBSD. Have observed, and still
observe the encoding glitch...as if it fails to read the meta tag. This
even after changing the xsl:output tag in the XSLT and moving the
hard-coded meta tag above the title. But is only so with bare UTF-8. It
does not seem to happen when I employ numeric character references.


Some old browsers (notably Netscape 4.x) are horribly nonconformant with
regard to encoding and how they interpret both entity references and numeric
character references, but they're not something you should be putting too much
effort into working around.

I try to keep them in mind. But I only really test against MSIE and Mozilla in their latest-and-greatest versions.

And at the bottom of the document, you tell people to switch to UTF-8, but
changing the encoding (decoding, actually) is not an option in IE, when
viewing browser-rendered XSLT output (i.e. when rendering an XML doc that
contains an xml-stylesheet PI pointing to an XSLT stylesheet).

I will re-word it. Meant for only when viewing *.html file resulting from XSLT of an XML care of Xalan.

You might also be interested in this little snippet:

Looks interesting. Appears to be somewhat MS-specific. Suppose I could tweak it for a more broad-spectrum report?

<script type="text/javascript"><!-- if (
navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
(parseInt(navigator.appVersion) >= 4) ) { document.write( "<p><tt>Since you
are using IE 4.0 or higher and do not have scripting disabled, I can tell that
this generated HTML document is being intepreted by the browser as <u>" +
document.charset + "</u> and that the browser's default encoding happens to be
<u>" + document.defaultCharset + "</u>.</tt></p>" ); } //--></script>

Thanks for your input. I will experiment further and adapt as much as I can make to work for both Mozilla and MSIE.

Respectfully,

Gan

--

Mistera Sturno - Rarest Extinct Bird

 <(+)__       Gan Uesli Starling
  ((__/)=-    Kalamazoo, MI, USA
   `||`
    ++        http://starling.us


XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list



Current Thread