Re: [xsl] Problem with Chinese (Solution)

Subject: Re: [xsl] Problem with Chinese (Solution)
From: "Michael Beddow" <mbnospam@xxxxxxxxxxx>
Date: Wed, 8 Aug 2001 08:36:32 +0100
Glad to see your problem was solved, Shaun, and your posting a fully
summary is much appreciated, but there are a few points in your
explanation of how the solution worked that need comment:

> This works great for standard encodings, but it will never work for
> encodings like Chinese (GB2312).
>
If by "standard encodings" you mean utf-8 or us-ascii, you're right, but
only because the encodings for the abstract characters common to both
happen to be indistinguishable, so in the absence of a different
encoding declaration the parser assumes the default utf-8 and all is
well. You would also get away with it if your encoding was ISO-8859-1
and happened not to contain any actual characters outside the subrange
that overlaps with us-ascii, but that would be sheer luck. For your
"encodings like.." you need to substitute "any encoding other than the
default", which would include, say, ISO-8859-1 containing accented
characters. Such encodings must be appropriately declared in your input
and output xml otherwise the parse will fail (and of course you also
have to load the data as xml where there is a specific call for that
purpose).

>
> However !!!  I did notice one interesting undesirable "feature" in the
> MSXML.  If you put in some <HEAD></HEAD> tags into the
> above XSL, then your output HTML contains the following by
> magic.
>
>   <head>
>   <META http-equiv="Content-Type" content="text/html; charset=UTF-16">
>   </head>

ISTR that this has been touched on here before, but since I'm not an
intensive MSXML user I can't be sure. Rogue reversions to UTF-16 did
occur with some earlier MS xml handling, but I thought that was now
fixed. Do you still get this if you specify the correct output encoding
attribute in an xsl:output element in your XSL?  If so, what happens if
you also explicitly generate an HTML HEAD that includes a META tag with
the correct charset declaration? Does this still produce a charset value
of UTF-16? If so, that would indeed be a bug, though I'm not convinced
that the behaviour as you've described it is one.

> Maybe this is what MSXML thinks the closest thing to GB2312 is.

Surely no one at Microsoft could be daft enough to think that.

> The bad thing is that the IE5.5 browser doesn't know how to
Auto-Select
> the GB2312 encoding when this is present.  This might be considered a
> bug.

Well, one can argue about the wisdom of including an auto-select feature
at all, and question the heuristics by which IE5's auto-select operates,
but what's happening here is that IE5.5 sees that the page author has
gone to the bother of declaring a charset in a META tag and so believes
what that tag says. That seems to me a defensible choice on the part of
the coders, and again not really a bug.

Michael
---------------------------------------------------------
Michael Beddow   http://www.mbeddow.net/
XML and the Humanities page:  http://xml.lexilog.org.uk/
---------------------------------------------------------


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread