Re: [xsl] doctype declaration and msxmldom

Subject: Re: [xsl] doctype declaration and msxmldom
From: Mike Brown <mike@xxxxxxxx>
Date: Thu, 19 Jun 2003 18:51:14 -0600 (MDT)
The post to which I'm replying had nothing direclty to do with XSLT, but I
feel compelled to respond, because the information in it is rife with errors,
and because I'm obsessed with character encoding.

Nancy Pate wrote:
> I work with SGML.  When you declare "DOCTYPE" the composing/processing
> engine is going to expect a DTD.

Do not try to divine the XML parsing model or the XSLT processing model
just based on the default, apparent behavior of your favorite toolsets and
their usually less-than-thorough documentation.

I don't know what the requirements are for SGML parsers, but XML parsers have
much leeway as to when they are required to read a DTD, and what parts of the
DTD they must read (for example, external parts are optional).

More importantly, the XML parser's user has control over whether the parser
tries to validate or not. And the parser (say, Expat), can be set to do things
like read external entities but not external DTDs, allowing situations where
you can still parse a document that contains an entity reference without a
corresponding entity declaration, so long as the standalone declaration
agrees.

Furthermore, XML document authors have flexibility in what they can do.. for
example <!DOCTYPE blah> is legal even though it does not contain any DTD info
at all.

>  Can you declare the necessary encoding
> in the XML declaration (<?xml version="1.0" encoding="ISO-8858-1"?>)

ISO-8859-1. And an encoding declaration is an informative hint to the XML
parser to tell it how the *bytes* of the document (think of what you see if
you look at the document in a hex editor rather than a text editor) should be
converted to Unicode characters as it is read in.

There is only one correct encoding that you can declare: the one actually used
for producing the bytes that comprise that particular document. It has to be
accurate, or "close enough" in the case of, say, a US-ASCII encoded document
being declared as UTF-8. You cannot just make it up.

> and then use the Unicode number?

"using the Unicode number" in more correct terminology is
"using a (numeric) character reference" like "&#232;" or "&#xE8;"

By definition, a character reference always uses Unicode code points.
So "&#232;" or "&#xE8;" are both referring to Unicode character number 232
(decimal), which happens to be the small Latin letter e with grave accent.

When using a character reference, the fact that the document was encoded with
whatever encoding was used is irrelevant. &#232; always means Unicode
character at code point 232, never "byte 232 in encoding XYZ", unless you are
using that nonconformant abomination known as Netscape Navigator (or
Communicator) version 4.

>  I have a table that says that &egrave; has
> a UTC code of #x00E8

To hopefully clear up your confusion with more correct terminology...

The predefined HTML entity named "egrave" has as its replacement text the
actual character number E8 (hexadecimal) of the Universal Character Set (UCS):
small Latin letter e with grave accent. 

You can more or less think of entities as text macros, although every document
or binary 'file' is on some level an entity, so it's not a perfect analogy.
Please try to distinguish between a named "entity reference" and a numeric
"character reference" though. Then you can get creative and say "character
entity reference" when you mean things like "&egrave;" so long as the egrave
entity's replacement text is a single character.

The UCS is the normative basis of SGML, HTML, and XML, and is defined by
ISO/IEC 10646, the international standard that assigns numbers to the idea of
nearly every character used in nearly every written language script on the
planet. This standard is often informally referred to as Unicode because it is
developed in tandem with and shares its character assignments with The Unicode
Standard, a more thorough but perhaps less political publication that does not
fall under the ISO's jurisdiction.

UTC (what you said) means Greenwich/Zulu time zone, pretty much...

-Mike

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread