Re: Special entity characters in Shift-JIS XSL.

Subject: Re: Special entity characters in Shift-JIS XSL.
From: Tony Graham <tgraham@xxxxxxxxxxxxxxxx>
Date: Wed, 15 Dec 1999 13:05:11 -0400 (EST)
At 15 Dec 1999 08:55 -0500, Douglas Weed wrote:
 > An application has been developed which uses the Microsoft MXSML parser
 > enclosed in a DLL to apply XSL files against an XML stream.  The encoding is
 > in Shift-JIS as the application is double byte. The net result of the
 > application is HTML.  The target browser has been developed to understand
 > certain 'special characters' or entities, which in themselves are double
 > byte.  Much in the same way &#39; maps to an asterisk.  For example
 > &#249;&#134; would yield a special 2 byte character which is a Q surrounded
 > by a circle.  If this character sequence is placed directly into a .htm
 > page, it works.  However, as I suspected, when placed within an xsl file and
 > transformed with the xml, it yields nothing since the parser tries format
 > it.  I attempted to use an in-line DTD to define the entity and use the
 > definition within the XML file, however, MSXML has some real difficulties
 > handling an in-line DTD when the XML is a character string and not a file.
 > The work-arounds specified by MS are not feasible.  The question : does
 > another technique exist to have the XSL file ignore &#249;&#134; and pass it
 > straight through to the HTML stream?  Sorry for the length of the message
 > and thanks for any responses. 

In XML, numeric character references are always to Unicode code
values.  A conforming application should recognise &#249;&134; as
LATIN SMALL LETTER O WITH STROKE followed by one of the C1 control
characters.

What comes out of your MSXML DLL almost certainly uses two bytes to
represent each character -- UTF-16 uses two bytes per character, and
UTF-8 also uses two bytes per character for character numbers in that
range.

Relying on two numeric character references to represent a double-byte
sequence is fragile, as you have found.

The numeric character reference for the Unicode character CIRCLED
LATIN CAPITAL LETTER Q is &#x24C6;.

I don't know that MSXML allows you to specify the output encoding.
However, if I'm correct in thinking that a circled Q is gaiji in
Shift-JIS, the character might be dropped in a conversion to Shift-JIS
anyway.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread