Re: copying XML containing Unicode to HTML

Subject: Re: copying XML containing Unicode to HTML
From: Tony Graham <tgraham@xxxxxxxxxxxxxxxx>
Date: Thu, 18 May 2000 09:24:59 -0400 (EST)
At 17 May 2000 16:55 -0700, Dan Cornell wrote:
 > <xsl:output method="HTML" />
 > ...
 > <xsl:text disable-output-escaping = "yes">"&gt;</xsl:text>
 > <xsl:value-of select="TX" disable-output-escaping="no"/>
 > <xsl:text disable-output-escaping = "yes">&lt;/span&gt;</xsl:text>

Why you're faking elements with text escapes me.

 > The problem is with the contents of the TX node: copying the contents of
 > <TX> results in the special UNICODE characters &#x2014; getting translated
 > by the XSL processor into a question mark (?).  The output looks like so:
 > ?? Unescaped Text.

You don't say which XSLT processor you are using, but at a guess, your
output is in UTF-8 and whatever you're using to view the output can't
cope with UTF-8 or can't represent the &#x2014; character (EM DASH).

The fact that you have two question marks and not 4, 6, or 16 shows
that the bytes representing each character are being parsed as a
single character correctly (although output of ' ? ?' would indicate
UTF-16 text being parsed as 8-bit characters).

Since you're producing HTML, see if the XSLT processor inserted a
<meta> tag indicating the charset.

 > What I would like instead is to have the contents of the <TX> node copied to
 > the output HTML unchanged, i.e. I would like to see the following:

Your input was changed when your XML file was parsed: the characters
that you represented with numeric character references will be
represented in the guts of the XML parser or XSLT processor using the
actual characters.  The best you can do is reconstruct the numeric
character references in output, since there's no way that those
characters were 'unchanged' after being parsed.

 > &#x2014;&#x2014; Unescaped Text.
 > Do you know how to do this?

Specify an output encoding, e.g. ISO-8859-1, that doesn't encode
&#x2014; so the XSLT processor is forced to output a numeric character
reference for the character.


Tony Graham
Tony Graham                            mailto:tgraham@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.      
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
  Mulberry Technologies: A Consultancy Specializing in SGML and XML

 XSL-List info and archive:

Current Thread