Re: [xsl] XInclude as an XSLT transformation?

Subject: Re: [xsl] XInclude as an XSLT transformation?
From: "W. Eliot Kimber" <ekimber@xxxxxxxxxxxxxxxxxxx>
Date: Thu, 30 Dec 2004 10:11:36 -0600
Oleg Tkachenko wrote:

Colin Paul Adams wrote:

Now that XInclude is a recommendation, I took another look at it.

Since it is essentially a tranformation on the XML Information Set, it
occurs to me it ought to be possible to write a generic XInclude
processor in XSLT.


Well, one can implement some XInclude basics in XSLT (and many did), but developing a conforming implementation would be not a piece of cake and can't be done with pure XSLT1 for sure.

If you're only doing resolution in memory and you don't care about non-XML inclusion then you can do a nominally conforming implementation with 1.0, although it's easier to do if you can use functions (e.g., using Saxon's "XSLT 1.1" support.


One issue is how to treat IDs and references. My contention is that the XInclude spec is at least fuzzy and at worst just wrong on how IDs and references should be treated in the transcluded result.

The issue is that in the transcluded result the IDs must be unique (this is a basic requirement of XML). It's not clear, at least to my reading, whether or not the XInclude allows or requires ID values to be rewritten such that all IDs in the result are unique even if two input elements (from two different source documents) have the same ID value. Likewise, all references to those IDs have to be rewritten to maintain referential integrity in the result.

Since the XInclude spec is defined in terms of an infoset-to-infoset transform it's clear that referential integrity must be maintained, which can always be done (the reference property at the infoset level is a direct pointer, rather than the utterance of an ID value, so it doesn't matter what the syntactic component says).

The issue for me is twofold:

1. For authoring you must author addresses in terms of the locations of targets *as authored*, not as transcluded. The reason is simple: the location as authored is the only location you can reliably know at authoring time in the general case. For example consider the case where you have two XML documents, Module A and Module B, which can be XIncluded into any number of top-level compound documents. If I need to author a link from an element in Module A to an element in Module B, the only thing I know *for sure* at authoring time is the location of Module B and the ID of the target element (assuming ID-based addresses to keep things simple for now). You can't know what compound documents these modules might be included into in the future, therefore you can't author the links in terms of the target's location or presence in some transcluded result.

What it comes down to for me is: if you are using XInclude only as a functional replacment for external parsed entities, then there's no issue, but that misses the whole point of being able to do true use-by-reference, which is to enable managable re-use of information objects in different compound documents. As soon as you have the same element included in two different contexts, you must step up to issues of address creation, link authoring, and link management.

2. Trying to make element IDs unique across all documents is difficult and unnecessary. Each XML document establishes a unique ID name space and every element is uniquely identified by the pair of the document location and the ID within that document. Therefore, trying to make IDs unique across documents is unnessary, because all elements *already have globally unique IDs*. We can depend on the storage object system (e.g, the local filesystem, a repository like Documentum, whatever) to ensure unique locations for each document. Ensuring globally unique element IDs would require additional infrastructure and business rules that is expensive to implement, difficult to enforce, and impossible to ensure in the general case (for example, what happens when two previously disjoint document sets are put together in a single respository space?).

Therefore, at least in the context of a technical document authoring environment, I see no alternative but to be prepared to rewrite all element IDs and references to them as part of the process of resolving XIncludes. My reading of the XInclude recommendation suggests that this is allowed within a conforming system, but in any case it is an unavoidable requirement for processibility, so I don't really care if it's conforming or not--for me XInclude would be unusable without being able to do this.

One side effect of this reality is that the support code a given document type must implement the ID and pointer rewriting logic since only the document type can know for sure what attributes are IDs and references. This is partly because you can't depend on schema or DTD awareness to provide attribute data types and partly because there are no standards for declaratively indicating which attributes are identifiers and which are pointers (and what the pointer notation is). [This type of mechanism would require the level of infrastructure that we put into the HyTime standard, a degree of complexity that has in practice proved to be not worth the value--it's almost always easier to implement local, one-off solutions.]

The other issue I have with XInclude is that it provides no direct mechanism for specialization. That is, unlike XLink, there is no standard way to have specialized element types that function as xi:include elements. For example, if you want to be able to control in a schema where references to specific element types are allowed your only choice is to have specialized element types that correspond to the element types you want to reference. Here is a typical system I use:

<!ELEMENT chapter
  (title,
   ((intro,
     (section |
      section_inclusion)+)|
    section_body))
>

<!ELEMENT section_inclusion
  EMPTY
>
<!ATTLIST section_inclusion
  %link_target_atts;
  %pub_def_nsatt;
  reftype
    CDATA
    #FIXED "pub:section"
  class
    CDATA
    #FIXED "- xi:include "
  xmlns:xi
    CDATA
    #FIXED "http://www.w3.org/2001/XInclude";
>

Here a chapter conceptually contains one or more section elements. In order to constrain XIncludes to just sections within Chapter I have the section-specific inclusion element "section_inclusion", which can occur only where "section" is allowed.

On the section_inclusion element, I use two attributes to further describe the purpose and constraint of the inclusion element:

reftype=

This attribute declares what element types the inclusion element is allowed to reference. The value is an XSLT match statement. In this example, only "pub:section" is allowed, but if there were several specialized forms of section, you might have something like "pub:section | formal_procedure | more_info_sectin" or whatever. At the XSLT level it's easy to implement this check using an "eval" extension. In an XML editor like Arbortext's Epic Editor it's easy to implement the constraint when providing user interfaces for selecting inclusion targets, for example.

class=

This attribute declares that section_inclusion is a specialization of the XInclude include element. The syntax is adapted from the DITA spec and is simply a fully-qualified name (DITA uses a different syntax but the intent and effect is the same).

Given these attributes a generic XInclude processor can simply look for either xi:include elements or elements specialized from xi:include and apply the transclusion processing to them.

You can find my paper on this subject, which goes into more detail and provides sample XSLT code, on the XML Europe 2004 site at:

http://www.idealliance.org/papers/dx_xmle04/index/author/7f4a5280a580c0707f36f54ec4.html

Cheers,

E.
--
W. Eliot Kimber
Professional Services
Innodata Isogen
9390 Research Blvd, #410
Austin, TX 78759
(512) 372-8122

ekimber@xxxxxxxxxxxxxxxxxxx
www.innodata-isogen.com

Current Thread