Re: [xsl] Perl workaround for XSLT escaped UTF-8 pass-thru to XHTML

Subject: Re: [xsl] Perl workaround for XSLT escaped UTF-8 pass-thru to XHTML
From: Mike Brown <mike@xxxxxxxx>
Date: Fri, 21 Mar 2003 20:02:19 -0700 (MST)
Gan Uesli Starling wrote:
> Gan Uesli Starling wrote:
> > I have an XML here...
> > 
> > http://starling.ws/XML/howto.xml
> > 
> > ...which uses this XSLT...
> > 
> > http://starling.ws/XML/howto.xsl
> > 
> > ...and looks fine when viewed in either
> > Mozilla 1.2.1 or MSIE 6 SP 1 as *.xml.
> > 
> > But when I use that same XSLT to output
> > to *.html as a file, then go to view it
> > as *.html with those same browsers then
> > the UTF-8 (since it is not escaped with
> > ampersand-pound) does not display...
> 
> Nobody answered my plea about passing escaped
> UTF-8 thru from XML to HTML. So I cobbled my
> own ex-post-facto Perl solution. Not elegant,
> but at least it works. See results at...

You're making this much harder than it needs to be.

Your stylesheet is hard-coded to create the following element in the HTML
output, after the <title>, even though the output encoding might not actually
be UTF-8:

  <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

Your stylesheet contains the following hint to the processor as to how to
serialize the result tree:

  <xsl:output method="html"/>

Because you did not specify a preferred encoding for output, (e.g. <xsl:output
method="html" encoding="UTF-8"/>), MSXML is going to default to UTF-16.

Further, since in your example you are invoking the transformation via an
xml-stylesheet PI, rather than from script in an HTML document or separate
msxml.dll-using code altogether (e.g. msxsl.exe), the browser is invoking
MSXML in a manner that forces the output to be UTF-16 *regardless* of what
encoding was asked for in xsl:output.

In either case, following the guidelines for HTML output, MSXML adds the
following *before* the <title>:

  <META http-equiv="Content-Type" content="text/html; charset=UTF-16">

You can see this if you have installed the "tools for viewing and validating
XSL output" (or whatever they're calling it today) from the MSDN site.

Since this META comes first, the browser ignores the one that says UTF-8. This
is good, because your HTML isn't UTF-8 encoded, it is UTF-16 encoded. The
document's bytes have to be decoded into characters properly or you'll get a
big nonsensical mess. So the UTF-16 bytes are mapped back into characters
properly, and the markup is parsed and rendered appropriately.

If I put encoding="utf-8" in the xsl:output instruction, and invoke the
transform without using the xml-stylesheet PI, such as via msxsl.exe, I can
get actual UTF-8 output that contains a generated META that says UTF-8.

So you should get rid of that hard-coded meta tag... the processor will
generate the correct one.

Now as for your other problem, I fail to see what the problem is with having
encoded characters in the document. If the document is UTF-8 or UTF-16
encoded, there's no need to ever use a numeric character reference, so long as
the document or delivery protocol properly declares the encoding. The browser
will handle it just fine, as long as the user has not overridden the encoding
(which, sadly, they are allowed to do).

If you really want numeric character references anyway, then why use UTF-8 or
UTF-16 output? Output ASCII instead, and the serializer will take care of
generating NCRs for you. No need to run the document through a perl script.

  <xsl:output method="html" encoding="us-ascii"/>

It's ironic that the document you are rendering attempts to explain some of
these issues.

For example, in the document's content, you imply that HTML doesn't "use
Unicode" and that XML does. This is wrong. Both HTML 4.0 and XML 1.0 consist
of a string of Unicode characters, taken from almost exactly the same
repertoire (all of Unicode, minus a handful of control and non-characters). In
both HTML and XML, these characters manifest as bytes according to the same
set of encodings: those approved for Internet use by the IANA (utf-8, utf-16,
iso-8859-1, etc.). Both HTML and XML use either decimal or hexadecimal character
references (sequences of universally encodable characters like & # 1 6 0 ;) to
represent single Unicode characters that typically wouldn't be supported in
the document's actual encoding. The only real difference between XML and HTML,
character-wise, is in how the documents declare their encoding internally
(meta tag vs prolog), in whether it is a fatal error for the encoding
declaration to be wrong, and in the fact that HTML defines many more built-in
entities (nbsp, etc.) to represent characters.

Some old browsers (notably Netscape 4.x) are horribly nonconformant with
regard to encoding and how they interpret both entity references and numeric
character references, but they're not something you should be putting too much
effort into working around.

And at the bottom of the document, you tell people to switch to UTF-8, but
changing the encoding (decoding, actually) is not an option in IE, when
viewing browser-rendered XSLT output (i.e. when rendering an XML doc that
contains an xml-stylesheet PI pointing to an XSLT stylesheet).

You might also be interested in this little snippet:

<script type="text/javascript"><!-- if (
navigator.userAgent.toLowerCase().indexOf("msie") != -1 &&
(parseInt(navigator.appVersion) >= 4) ) { document.write( "<p><tt>Since you
are using IE 4.0 or higher and do not have scripting disabled, I can tell that
this generated HTML document is being intepreted by the browser as <u>" +
document.charset + "</u> and that the browser's default encoding happens to be
<u>" + document.defaultCharset + "</u>.</tt></p>" ); } //--></script>

Mike

-- 
  Mike J. Brown   |  http://skew.org/~mike/resume/
  Denver, CO, USA |  http://skew.org/xml/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread