Re: multiple special characters in XML

Subject: Re: multiple special characters in XML
From: Tony Graham <tgraham@xxxxxxxxxxxxxxxx>
Date: Fri, 3 Sep 1999 00:35:08 -0400 (EST)
At 31 Aug 1999 18:26 -0700, regan@xxxxxxxxxxx wrote:
 > 	We have an application that takes sections of user generated HTML
 > files, embeds these sections into a large XML file, then later, when
 > requested, generates an HTML file from the XML and a XSL file (using XT).
 > Our users have started introducing funny characters into the HTML (OK, what
 > happens is they use Microsoft Word to introduce the funny characters and
 > Word does the conversion to HTML, and we end up with "&eacute;" or some such
 > in our HTML - then our XML)

If Word thinks it's producing HTML, then it's probably using only the
entities defined in the various HTML recommendations.  HTML 3.2
borrowed the ISO Latin-1 entity set.  HTML 4.0 got more adventurous,
and borrowed bits and pieces from ISO entity sets plus declared some
that ISO hasn't standardised.  In both cases, the entities are defined
in the respective recommendations and/or in the files that accompany
them, all of which are available from the W3C web site.

In both cases, you'll also have to do some massaging to make the
entity declarations into XML.  The following example from
HTMLsymbol.ent from HTML 4.0:

<!ENTITY fnof     CDATA "&#402;" -- latin small f with hook = function
                                    = florin, U+0192 ISOtech -->

should become:

<!ENTITY fnof     "&#402;"><!-- latin small f with hook = function
                                    = florin, U+0192 ISOtech -->

since CDATA entities aren't in XML, and you can't put comments inside
other declarations in XML.

I haven't looked, but presumably the XHTML PR has the XML versions of
the HTML 4.0 entity declarations.

You can reference the entity set from your DTD, or from the internal
subset of your documents if you don't have a fully fledged DTD.

Regards,


Tony Graham
======================================================================
Tony Graham                            mailto:tgraham@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9632
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread