Subject: [xsl] Isolation levels (long and technical) From: Colin Paul Adams <colin@xxxxxxxxxxxxxxxxxx> Date: 18 Dec 2005 20:08:43 +0000 |
Following much discussion on the saxon list about saxon's discard-document() extension function, the subject of the stability of fn:collection() was raised on the public-qt-comments list. After some discussion, Michael Kay made the following suggestion: "Perhaps we can solve this as follows: (a) we specify that doc() and collection() are stable by default (in SQL terms, the default isolation level is SERIALIZABLE) (b) we specify that implementations may provide an option to select a different isolation level (c) we specify that a call on doc() or collection() may fail if the implementation cannot provide access to the requested resource with the requested isolation level This is anticipating a more comprehensive treatment of transactions and isolation levels in a future version of the spec." I have been experimenting with an implementation of this idea in gexslt, and I thought I'd share what I have done, to invite discussion. I chose to implement the options by means of a user-definded data element. This has several advantages over an extension function, not the least being portability (an XSLT processor that doesn't recognize a user-defined data element must simply ignore it, whereas an unrecognized extension function will cause an error). This seems to me of great importance for what is essentially an optimization hint - the meaning (i.e. the result) of the transformation is the same in either case - only the performance should change (although an error might result due to exhaustion of resources, but this is true for any transformation). of course, portability would be even greater if an exslt standard could be agreed. Anyway, the user-defined data element has the following syntax, at present: <!-- Category: declaration --> <gexslt:isolation-levels <!-- Content: (gexslt:collection*, gexslt:document*) --> /> <gexslt:document href= uri-reference isolation-level = ( "read-uncommitted" | "read-committed" | "repeatable-read" | "serializable" ) /> <gexslt:collection href= uri-reference isolation-level = ( "read-uncommitted" | "read-committed" | "repeatable-read" | "serializable" ) /> The content elements set the ioslation level for each document or collection specified. As many content elements as required can be specified. The definition of the semantics of each isolation level I took from SQL-92 (well, actually the MS-SQL-SERVER documentation, which alleges to quote it). Any usage of the fn:doc(), fn:collection() or fn:document() with a URI not specified within gexslt:isolation-levels uses the default value of serializable, which corresponds to the standard stable behaviour for these functions (so the implementation will have to use some strategy such as locking the nodes in memeory for the duration of the transformation). I spent a significant amount of time thinking what should happen if multiple definitions were given for the same URI (within one URI space - there are seperate URI spaces for gexslt:document and gexslt:collection). Note that there are no syntax restrictions on having multiple gexslt:document/collection elements for the same URI within a single gexslt:isolation-levels element, or across the set of gexslt:isolation-levels elements (note that these might occur in seperate stylesheet modules). In the end I decided that specifying two definitions for the same URi (within a given URI space) would be treated as a static error, unless the isolation-level was the same in every case. otherwise the result seems to be semantic nonsense. For the implementation of fn:doc() (and therefore fn:document() too), I chose to silently promote read-uncommitted to read-committed, and repeatable-read to serializable, as I do not think I can sensibly distinguish within these pairs for the inpit URI scheme that gexslt/gestalt supports (file, data, http and ftp), although I did wonder if read-committed should imply no HTTP caching. In the end I rejected this thought, as too problematic. I also made the same decision for fn:collection() with file URIs, this being the only URI scheme that I currently support with fn:collection(). it interprets a file: URI, ending with a / only, as a local directory, and attempts to parse all files within this directory as XML files, and ignores all failures. Obviously a URI scheme targetted on a relational database, for instance, would distinguish all four isolation levels. Fortunately my architecture allows for this - the isolation-level checking/enforcement is done at the individual URI scheme level for fn:collection() (whereas for fn:doc() I treat all schemes identically, but this is not part of the definition of gexslt:isolation-levels). There remains a tricky interaction between the two URI spaces. Although the XPATH 2.0 specification does not demand such an interpretation (but it certainly doesn't forbid it), I have chosen to link the two URI spaces in the following manner For a given file: collection URI, file:///a/b/c/, fn:collection assigns a document-uri to each resulting document node of file:///a/b/c/file-name. If the resulting file: URI is also accessed via fn:doc(), then the isolation-levels must be specified compatibly, or else an dynamoc error is raised (actually, the specification of gexslt:isolation-levels is in terms of isolation-levels defined, rather than accessed, but the implementation is a little more relaxed in some cases). I believe such a restriction is necessary to avoid nonsense, but of course, it is not foolproof - you can have two different URIs referring to the same resource - in this case via a symbolic link, for instance. Finally, testing the implementation shows, as I expected, that setting an isolation-level of read-committed results in a slower transformation than specifying serializable. I say as expected - of course the intention of specifying a lower isolation level (read-committed being lower than serializable), is that better performance would result - that is the trade-off you make when reducing the repeatability of the results. But in my test cases, I simply didn't have a large enough set of data to stress the performance of serializable, so the need to parse the XML files twice for read-committed dominated the performance issue. If anyone would like to donate a large collection of XML files for testing (I have 2GB RAM on this machine, so I think a total collection size of at least 1/2 GB would be necessary to bring the garbage collector under strain). -- Colin Adams Preston Lancashire
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: [xsl] How do I not ignore white, Mark Wilson | Thread | RE: [xsl] Isolation levels (long an, Michael Kay |
Re: [xsl] How do I not ignore white, Mark Wilson | Date | [xsl] JDK 5 XSLTC handling of names, Julian Reschke |
Month |