Re: [xsl] XSLT/XPath 2.0 (was "Identifying two tags...")

Subject: Re: [xsl] XSLT/XPath 2.0 (was "Identifying two tags...")
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Mon, 13 May 2002 19:49:09 +0100
Hi Stuart,

Just a couple of clarifications to your description of the cosmology
of XML...

> The base XML specification describes a data model, a serialization
> specification, and the DTD. The XML data model is a tree of elements
> and their attributes. The serialization specification says how data
> in the data model is written to a file. The document type definition
> (DTD) provides a coarse mechanism for writing metadata, that is data
> describing the XML document data.

It might be important to distinguish between the kind of metadata that
the DTD provides and what's normally called metadata -- information
about the content of the document. The DTD describes the markup used
in the document rather than the content of the document itself.

> The data in an XML document can be accessed programmatically through
> APIs like SAX and DOM. Those APIs are exposing the information set,
> or infoset, represented by the data in the document with all
> artifacts of the serialization removed.

Yes, although neither SAX and DOM technically expose the XML Infoset
(ala since they have their own
data models which aren't precisely the same.

> XML Schema is a replacement for the DTD metadata. With XML Schema we
> can associate programmatic type with content. The last three decades
> of progress in software engineering can be characterized as the
> promotion of type to a first-class programming concept --
> programming by type lies at the heart of object-oriented and
> component based programming. XML Schema also provides precise
> control over the data model, such as specifying the exact contents
> of an element or the range of occurrences allowed.
> All that is required of an XML parser is that it verify that the XML
> document it is parsing is well-formed. It can disregard the DTD
> entirely. That is called a non-validating parser. A validating
> parser must read the document's metadata (DTD or Schema) and verify
> that the XML document it is parsing is well-formed and valid
> according to the metadata.

Even a non-validating parser can't ignore the DTD entirely. Of
non-validating parsers, the XML 1.0 Rec (second edition) says:

  Definition: While [non-validating processors] are not required to
  check the document for validity, they are required to process all
  the declarations they read in the internal DTD subset and in any
  parameter entity that they read, up to the first reference to a
  parameter entity that they do not read; that is to say, they must
  use the information in those declarations to normalize attribute
  values, include the replacement text of internal entities, and
  supply default attribute values.

Thus the information set that you get when a DTD is present can be
different from the information set that you get for the same document
when the DTD is absent, even if you're not using a validating
processor. In particular, you might get attributes in the result that
you wouldn't get if the DTD were absent; if those attributes are
namespace declarations, that can change the entire meaning of the
document to a namespace-aware processor such as an XSLT processor.

> If I use the XML parser to validate an XML document against XML
> Schema metadata, then the data exposed by SAX or DOM is the PSVI,
> the Post Schema Validation Infoset. That says that the data in the
> XML document is valid according to the metadata. In the PSVI, type
> has already been associated with the data; all data in the document
> is valid against the metadata.

A couple of things here. Thus far, there is no API that supports the
PSVI, so any information that you get from a parser will come through
a proprietary interface. It's likely that a schema validator will
exist on top of a SAX or DOM parser.

The other thing is that just because you get a PSVI it doesn't mean
that the result is valid. One of the PSVI properties that a validator
adds to each element and attribute is a flag to say whether it's valid
or not, and another flag to say whether it's been validated or not.
It's possible to have a PSVI that's completely invalid, or one that's
partially valid, or one in which only particular element values are
associated with a data type. You can't make any guarantees.

> When programming with XML, we can define programmatic types based on
> XML Schema metadata. Then when we get data from the PSVI, we know
> that it is valid against the metadata and we can use that data to
> construct an object of that type, and do so without error. My data
> and all parts in it conform to the requirements of the type.
> Let's imagine that we want to write an XSLT processor. An XSLT
> processor does not work with documents: it uses an XML parser to
> parse the documents and present them through an API like DOM or SAX.
> The XSLT processor only sees infosets. If we use a validating XML
> parser and the documents have metadata in XML Schema, all the data
> that the XSLT processor sees is PSVI. So it makes sense that the
> specifications should be written in terms of the data provided to
> the application.

True, but lots of people aren't actually concerned about the types of
the information held in their documents -- as far as they're
concerned, it's just data. To these people, it doesn't make sense to
burden an XSLT processor with a schema validator (schema validation
being a laborious process) when they don't get any added value from
the PSVI over the normal XML Infoset.

> Returning to your observation, "the dependency on the complexities
> of XML Schema gives me precious little benefit, compared with the
> headaches...", I am trying to make the case that the benefit of the
> specification making use of PSVI is that XSLT implementers can
> program to that requirement and thus produce XSLT processors that
> are interchangeable. To specify otherwise would let XSLT processor
> behavior diverge, which would spell chaos. Count that as a big
> benefit to you.

Divergence in behaviour isn't always a bad thing. XML makes the
distinction between validating and non-validating processors. In
practice, most processors are validating, but the fact that there was
the possibility of constructing something quickly and easily to parse
XML was a major benefit during early development. I think it's
important to have different levels of conformance, so that there's an
opportunity for a wide array of processors to be developed to satisfy
different parts of the market.

> XSLT 1.0 and XPath 1.0 became W3C Recommendations in November 1999.
> XML Schema became a Recommendation in May 2001. That explains why
> XSLT 1.0 makes no reference to Schemas or the PSVI. But with XML
> Schema now in place as a cornerstone of XML technology, it is
> important to make the XSLT 2.0 and XPath 2.0 specifications
> consistent with the XML Schema.

I agree that this is important for the market. I think people will be
increasingly dissatisfied by the fact that having a schema for their
document doesn't give them any real benefit, and they're likely to
blame the things processing the document rather than the schema
language! ;)

On the other hand, I think that XPath/XSLT needs to keep an eye to the
future, and how XML Schema (and schema language use in general) might
develop. I'd prefer to see XPath/XSLT more loosely coupled with XML
Schema, such that different schema languages could be used to provide
the PSVI on which the XPath data model is based. With the current
design for the data model, and type derivation through names rather
than structure, I'm cautiously optimistic that this would be feasible.

> Does that make XSLT harder to describe? Not really, unless you plan
> to use the XSLT specifications as a textbook. If you think that XSLT
> should not be hard to describe, why not write about it yourself? The
> next great book on XSLT is waiting to be written.

I think that the issue about complexity is more a concern that however
good a writer you are, it might be impossible to explain the
underlying heuristics that govern the way the language works. For
example, how do you tell which of the built-in types from XML Schema
can have values created with a constructor? You can't use any
knowledge you might have of the type hierarchy or how types in XML
Schema are subdivided; instead, you have to learn the list. How do you
tell whether you can use the context node as a default argument to a
function? Well, there's an easy rule here -- if it was defined in
XPath 1.0 then you can, if it was defined in XPath 2.0 then you can't.
But as someone who isn't familiar with XPath 1.0, how do you know
which functions fall into which categories? My personal view is that
ironing out these inconsistencies and perhaps more importantly
presenting the information in a way that describes the heuristics
("all string functions can have an optional collation argument
that...") will aid the casual reader *and* the implementer no end.

I'll also add that implementers are people too, and don't usually
possess some magickal ability to understand specifications no matter
how technical. Part of the reason that there's quite a variety of
behaviour in XML Schema implementations is that the spec's so hard to
read that the implementers can't figure out what they're supposed to
be doing. What's more, it's meant that there are lots of places in XML
Schema where there are internal contradictions that nobody (not even
the writers) spotted because there's very little overall sense of how
it fits together.



Jeni Tennison

 XSL-List info and archive:

Current Thread