Re: [xsl] Adding entity declarations to DOCTYPE in xml output

Subject: Re: [xsl] Adding entity declarations to DOCTYPE in xml output
From: "Eliot Kimber ekimber@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Tue, 26 Feb 2019 21:52:28 -0000
That general requirement (simple string macros) can be satisfied using
XInclude, which is implemented by most, if not all, of the modern XML
parsers.

XInclude has many limitations (it is not a true use-by-reference facility) but
it does have at least the same level of utility as text entities without
requiring the use of DTDs and without some of the problematic aspects of DTDs
(for example, you can choose to defer or ignore XInclude elements if you want,
which I often do want depending on processing context).

I could go farther and say that the original SGML design of DTDs was entirely
misguided as well and should never have been done that way and certainly
shouldn't have been carried into XML (again, I certainly argued *for* them at
the time) but that's easy for me to say now. At the time that SGML was being
defined and implemented the DTD syntax seemed perfectly sensible and it took a
long time for us to recognize the inherent problems with DTDs as they exist in
SGML and XML.

In particular, because they are a purely syntactic mechanism DTDs are a
security risk and provide no reliable declaration of the actual semantic
document type of the document that exhibits the DOCTYPE declaration.

Consider this example:

<!DOCTYPE foo [
  <!ENTITY gotcha SYSTEM "/usr/etc/.passwords">
]>
<foo>&gotcha;</foo>

Now load that into a CMS that shall remain nameless running as "root" and look
at the content that gets stored. Oops.

Or consider:

<!DOCTYPE notabook PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
                      "http://docbook.org/xml/4.5/docbookx.dtd";[
<!ELEMENT notabook
  (foo, bar?)
]>
<!ELEMENT foo EMPTY >
<!ELEMENT bar EMPTY >
<notabook>
<foo/>
</notebook>

Here the DOCTYPE appears to declare this to be a DocBook 4 book and many, if
not most, DTD-aware systems will use the public ID to bind this document to
its DocBook-specific configuration.

But this is clearly not a DocBook document (at least to a human observer). But
an XML system that simply requires the document to be A) valid and B)
associated with a known external DTD will likely happily accept this
document.

Thus, the DOCTYPE declaration tells you *nothing* actionable about the
document itself. It's completely valid (assuming I didn't introduce typos in
the internal declaration subset) but meaningless.

By having the grammar declared only by reference, i.e., RELAX NG, XSD, or some
other grammar, and by using namespaces to qualify at least one thing in the
document (as the DITA standard does with the @dita:DITAArchVersion attribute)
the document is unalterably associated with the definition of the thing it's
supposed to be (that is, the namespace name and the URIs of any associated
grammars function as names of the "true type" of the document, as opposed to
just pointers to syntactic rules that guide parsing and validation).

Compare with:

<?xml-model href="http://docbook.org/xml/5.1/rng/docbook.rng";
schematypens="http://relaxng.org/ns/structure/1.0";?>
<?xml-model href="http://docbook.org/xml/5.1/rng/docbook.rng";
type="application/xml" schematypens="http://purl.oclc.org/dsdl/schematron";?>
<notabook>
<foo/>
</notebook>

This is clearly, and unambiguously, not a DocBook document. The model
references unalterably bind the document to governing schemas that will detect
the document's invalidity. The lack of the expected (and required) DocBook
namespace on the root element also exposes this as not being a DocBook
document.

Likewise, there is no simple syntactic macro expansion happening here, so the
security exposure is lower.

So not a fan of DTDs.

Current Thread