[xsl] Isolation levels (long and technical)

Subject: [xsl] Isolation levels (long and technical)
From: Colin Paul Adams <colin@xxxxxxxxxxxxxxxxxx>
Date: 18 Dec 2005 20:08:43 +0000
Following much discussion on the saxon list about saxon's
discard-document() extension function, the subject of the stability of
fn:collection() was raised on the public-qt-comments list.

After some discussion, Michael Kay  made the following suggestion:

"Perhaps we can solve this as follows:

(a) we specify that doc() and collection() are stable by default (in SQL terms,
the default isolation level is SERIALIZABLE)
(b) we specify that implementations may provide an option to select a different
isolation level
(c) we specify that a call on doc() or collection() may fail if the
implementation cannot provide access to the requested resource with the
requested isolation level

This is anticipating a more comprehensive treatment of transactions and
isolation levels in a future version of the spec."

I have been experimenting with an implementation of this idea in
gexslt, and I thought I'd share what I have done, to invite
discussion.

I chose to implement the options by means of a user-definded data
element. This has several advantages over an extension function, not
the least being portability (an XSLT processor that doesn't recognize
a user-defined data element must simply ignore it, whereas an
unrecognized extension function will cause an error). This seems to me
of great importance for what is essentially an optimization hint - the
meaning (i.e. the result) of the transformation is the same in either
case - only the performance should change (although an error might
result due to exhaustion of resources, but this is true for any
transformation). of course, portability would be even greater if an
exslt standard could be agreed.

Anyway, the user-defined data element has the following syntax, at
present:

<!-- Category: declaration -->
<gexslt:isolation-levels
  <!-- Content: (gexslt:collection*, gexslt:document*) -->
/>

<gexslt:document 
        href= uri-reference
        isolation-level = ( "read-uncommitted" | "read-committed" |
                            "repeatable-read"  | "serializable" ) 
/>

<gexslt:collection
        href= uri-reference
        isolation-level = ( "read-uncommitted" | "read-committed" |
                            "repeatable-read"  | "serializable" ) 
/>

The content elements set the ioslation level for each document or
collection specified. As many content elements as required can be
specified.
The definition of the semantics of each isolation level I took from
SQL-92 (well, actually the MS-SQL-SERVER documentation, which alleges
to quote it).

Any usage of the fn:doc(), fn:collection() or fn:document() with a URI
not specified within gexslt:isolation-levels uses the default value of
serializable, which corresponds to the standard stable behaviour for
these functions (so the implementation will have to use some strategy
such as locking the nodes in memeory for the duration of the
transformation).

I spent a significant amount of time thinking what should happen if
multiple definitions were given for the same URI (within one URI space
- there are seperate URI spaces for gexslt:document and
gexslt:collection). Note that there are no syntax restrictions on having multiple 
gexslt:document/collection elements for the same URI within a single
gexslt:isolation-levels element, or across the set of gexslt:isolation-levels elements 
(note that these might occur in seperate stylesheet modules).

In the end I decided that specifying two definitions for the same URi
(within a given URI space) would be treated as a static error, unless
the isolation-level was the same in every case. otherwise the result
seems to be semantic nonsense.

For the implementation of fn:doc() (and therefore fn:document() too),
I chose to silently promote read-uncommitted to read-committed, and
repeatable-read to serializable, as I do not think I can sensibly
distinguish within these pairs for the inpit URI scheme that gexslt/gestalt
supports (file, data, http and ftp), although I did wonder if
read-committed should imply no HTTP caching. In the end I rejected
this thought, as too problematic.

I also made the same decision for fn:collection() with file URIs, this
being the only URI scheme that I currently support with
fn:collection(). it interprets a file: URI, ending with a / only, as a 
local directory, and attempts to parse all files within this directory
as XML files, and ignores all failures.
Obviously a URI scheme targetted on a relational database, for
instance, would distinguish all four isolation levels. Fortunately my
architecture allows for this - the isolation-level
checking/enforcement is done at the individual URI scheme level for 
fn:collection() (whereas for fn:doc() I treat all schemes identically, 
but this is not part of the definition of gexslt:isolation-levels).

There remains a tricky interaction between the two URI
spaces. Although the XPATH 2.0 specification does not demand such an
interpretation (but it certainly doesn't forbid it), I have chosen to
link the two URI spaces in the following manner 

For a given file: collection URI, file:///a/b/c/, fn:collection
assigns a document-uri to each resulting document node of
file:///a/b/c/file-name.
If the resulting file: URI is also accessed via fn:doc(), then the
isolation-levels must be specified compatibly, or else an dynamoc error
is raised (actually, the specification of gexslt:isolation-levels is
in terms of isolation-levels defined, rather than accessed, but the
implementation is a little more relaxed in some cases).
I believe such a restriction is necessary to avoid nonsense, but of
course, it is not foolproof - you can have two different URIs
referring to the same resource - in this case via a symbolic link, for
instance.

Finally, testing the implementation shows, as I expected, that setting
an isolation-level of read-committed results in a slower
transformation than specifying serializable.
I say as expected - of course the intention of specifying a lower
isolation level (read-committed being lower than serializable), is
that better performance would result - that is the trade-off you make
when reducing the repeatability of the results. But in my test cases,
I simply didn't have a large enough set of data to stress the
performance of serializable, so the need to parse the XML files twice
for read-committed dominated the performance issue.
If anyone would like to donate a large collection of XML files for
testing (I have 2GB RAM on this machine, so I think a total collection
size of at least 1/2 GB would be necessary to bring the garbage
collector under strain). 
-- 
Colin Adams
Preston Lancashire

Current Thread