Subject: Re: [xsl] doctype declaration and msxmldom From: Mike Brown <mike@xxxxxxxx> Date: Thu, 19 Jun 2003 18:51:14 -0600 (MDT) |
The post to which I'm replying had nothing direclty to do with XSLT, but I feel compelled to respond, because the information in it is rife with errors, and because I'm obsessed with character encoding. Nancy Pate wrote: > I work with SGML. When you declare "DOCTYPE" the composing/processing > engine is going to expect a DTD. Do not try to divine the XML parsing model or the XSLT processing model just based on the default, apparent behavior of your favorite toolsets and their usually less-than-thorough documentation. I don't know what the requirements are for SGML parsers, but XML parsers have much leeway as to when they are required to read a DTD, and what parts of the DTD they must read (for example, external parts are optional). More importantly, the XML parser's user has control over whether the parser tries to validate or not. And the parser (say, Expat), can be set to do things like read external entities but not external DTDs, allowing situations where you can still parse a document that contains an entity reference without a corresponding entity declaration, so long as the standalone declaration agrees. Furthermore, XML document authors have flexibility in what they can do.. for example <!DOCTYPE blah> is legal even though it does not contain any DTD info at all. > Can you declare the necessary encoding > in the XML declaration (<?xml version="1.0" encoding="ISO-8858-1"?>) ISO-8859-1. And an encoding declaration is an informative hint to the XML parser to tell it how the *bytes* of the document (think of what you see if you look at the document in a hex editor rather than a text editor) should be converted to Unicode characters as it is read in. There is only one correct encoding that you can declare: the one actually used for producing the bytes that comprise that particular document. It has to be accurate, or "close enough" in the case of, say, a US-ASCII encoded document being declared as UTF-8. You cannot just make it up. > and then use the Unicode number? "using the Unicode number" in more correct terminology is "using a (numeric) character reference" like "è" or "è" By definition, a character reference always uses Unicode code points. So "è" or "è" are both referring to Unicode character number 232 (decimal), which happens to be the small Latin letter e with grave accent. When using a character reference, the fact that the document was encoded with whatever encoding was used is irrelevant. è always means Unicode character at code point 232, never "byte 232 in encoding XYZ", unless you are using that nonconformant abomination known as Netscape Navigator (or Communicator) version 4. > I have a table that says that è has > a UTC code of #x00E8 To hopefully clear up your confusion with more correct terminology... The predefined HTML entity named "egrave" has as its replacement text the actual character number E8 (hexadecimal) of the Universal Character Set (UCS): small Latin letter e with grave accent. You can more or less think of entities as text macros, although every document or binary 'file' is on some level an entity, so it's not a perfect analogy. Please try to distinguish between a named "entity reference" and a numeric "character reference" though. Then you can get creative and say "character entity reference" when you mean things like "è" so long as the egrave entity's replacement text is a single character. The UCS is the normative basis of SGML, HTML, and XML, and is defined by ISO/IEC 10646, the international standard that assigns numbers to the idea of nearly every character used in nearly every written language script on the planet. This standard is often informally referred to as Unicode because it is developed in tandem with and shares its character assignments with The Unicode Standard, a more thorough but perhaps less political publication that does not fall under the ISO's jurisdiction. UTC (what you said) means Greenwich/Zulu time zone, pretty much... -Mike XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
RE: [xsl] doctype declaration and m, Nancy Pate | Thread | RE: [xsl] doctype declaration and m, Nancy Pate |
Re: [xsl] Line break in Text output, David Carlisle | Date | [xsl] XalanNode.cloneNode() issue !, KHARE,MAYANK (HP-Ind |
Month |