Re: [xsl] encoding issues

Subject: Re: [xsl] encoding issues
From: Mike Brown <mike@xxxxxxxx>
Date: Wed, 3 Jul 2002 12:29:33 -0600 (MDT)
Andrew Welch wrote:
> Ahaaa, so if I use the xml output method with xhtml doctypes, coupled
> with a hand coded <meta> tag I can get xhtml output thats decoded in
> unicode.  After testing - this works fine for me, is it ok?

Well, which did you want? HTML or XHTML?

As David Carlisle pointed out, IE only handles XHTML to the extent that it can
treat it either as sloppy HTML, or as XML that needs to be further styled with
CSS or XSLT or Microsoft's WD-xsl that refuses to die. The media type 
associated with the document (either the Content-Type sent in the HTTP message 
or the locally registered type associated via the filename extension) 
determines which avenue is taken.

XML and HTML both use Unicode (at the high, abstract level of unambiguously
numbered 'characters', not the one particular bytewise encoding of those 
characters that is supposed to be called UTF-16 but IE and Notepad call 
'Unicode'). That is, if the document arrives as bytes, which is what happens
98% of the time, it must first be *de*coded back into characters.

In XML the way you say "this document's bytes can be mapped unambiguously to
Unicode characters according to the utf-8 character map" is (in the absence of
external information) via the encoding declaration:

<?xml version="1.0" encoding="utf-8"?>

In HTML the way you say "this document's bytes can be mapped unambiguously to 
Unicode characters according to the utf-8 character map" is (in the absence of 
external information) via the meta element in the document head:

<meta http-equiv="Content-Type" content="text/html;charset=utf-8">

When IE is auto-selecting the encoding and treating the document as HTML, it 
is looking there, unless the HTML was delivered via an HTTP message that had
its own Content-Type header.

If you are generating HTML from an XSLT transform, then (usually) the XSLT 
processor is going to insert the meta element in the head for you. If you are 
generating XML, then (always) the XSLT processor is going to insert the 
encoding declaration for you.

> >(which is a bit risky
> >in general as you can not be sure that teh system will use the encoding
> >you ask for)
> 
> How can you be sure?

Due to the large number of HTML documents on the web that declare or are
served with the wrong charset parameter in the Content-Type (typically
documents encoded in one of the CJK encodings but that are mis-served as
iso-8859-1), browser manufacturers have been allowing the user to force all
documents to be decoded according to a certain character map, overriding any
declared encodings. In other words, you can turn that auto-select off, even 
though in theory it really shouldn't be an option.

>  Is it the case that you cant be sure they will
> have it installed?

Since HTML doesn't require that user-agents have the power to handle any
particular encoding, there is always a risk that the encoding you choose for
your document won't be something the browser knows what to do with.

XML at least requires parsers to handle UTF-8 and UTF-16.

> In which case, with utf-8 Im pretty safe?

Yes.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread