Re: [xsl] special characters in xml text paramter

Subject: Re: [xsl] special characters in xml text paramter
From: Mike Brown <mike@xxxxxxxx>
Date: Wed, 20 Nov 2002 14:59:45 -0700 (MST)
Alice Fan wrote:
> but how do i convince my browser?  It doesn't have anything to do with 
> browser versions right?

DECLARING THE ENCODING OF AN HTML DOCUMENT

Remember, a document is just a bunch of bytes when the browser (the HTML user
agent, to be more general) reads it off the network or disk. The browser has
to figure out how those bytes map to characters: the encoding.

You are supposed to tell the browser what the HTML document's encoding is by
putting that info in the "charset" parameter of the Content-Type header of the
HTTP response message that delivers the HTML. You could control this through
whatever mechanism your HTTP server offers for doing so.

HTML also provides a facility for embedding the same info in the HTML document
itself, and this is generally what most people do, rather than messing with
the HTTP server. In the document head, right after the title, they put:

  <meta http-equiv="Content-Type: text/html;charset=utf-8">

If you are using the HTML output method in your XSLT processor, then it
normally (although this is not a requirement) will add the meta tag to the
document head for you. If it's not doing this, then add it yourself, via your
stylesheet.

Of course, your browser has to be smart enough to honor this info, and you
must not do anything to your browser to override its ability to do so. They
often do let you override their behavior, so that you can correctly view a
document that has a misdeclared or undeclared encoding. For example, many
iso-2022-jp documents are served up as if they were iso-8859-1, so Japanese
users keep their browsers set to ignore the declared encoding and always use
iso-2022-jp instead.

One thing you may have forgotten to do is tell the XSLT processor your
desired output encoding. For example, in your stylesheet,

  <xsl:output method="html" encoding="iso-8859-1"/>

would give you iso-8859-1 encoded output, where there is just 1 byte per
character. With this particular encoding, characters above the first 256 bytes
of Unicode are not representable directly as bytes, so they will be emitted by
the XSLT processor as character entity references ("&copy;") or numeric
character references ("&#169;"). Depending on the XSLT processor, the upper
128 of that 256 may be emitted character entity references, in order to retain
compatibility with Netscape 4.x, which is horribly nonconformant in its
handling of single-byte document encodings.

Generally, it is safe to use utf-8 as the output encoding. It gives you the
full range of Unicode directly as 1 to 4 bytes per character, obviating the
need for character references or entity references. As long as you declare the
charset in the meta tag or in the transport, and the browser is not completely
brain-dead, the document's bytes will be decoded correctly. The same cannot be
said for your generic text editor.

   - Mike
____________________________________________________________________________
  mike j. brown                   |  xml/xslt: http://skew.org/xml/
  denver/boulder, colorado, usa   |  resume: http://skew.org/~mike/resume/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread