RE: [xsl] Static Validation of XSL Transformations

Subject: RE: [xsl] Static Validation of XSL Transformations
From: Anders Møller <amoeller@xxxxxxxx>
Date: Fri, 09 Mar 2007 10:31:01 +0100
Colin Adams wrote:
From: Anders Mxller <amoeller@xxxxxxxx>
>Given an XSLT 2.0 stylesheet, S, and two schemas, D_in and D_out, the
>tool is able to check statically that all output of S at runtime is
>valid according to D_out assuming that the input is valid according to
>D_in. Additionally, the tool produces a flow graph of S. Schemas are
>written in either DTD, XML Schema, or Restricted RELAX NG.
>A research paper describing the analysis is also available from the web

That was very interesting.

Does it cope with doc/document/collection functions? I couldn't work out from your paper how it would do that.

No, multiple inputs are currently not supported, but I believe that many typical cases could be handled with little effort.

Michael Kay wrote:
An interesting (and very thorough) paper.

Like Colin, I was wondering as I read it how you deal with multiple =
documents (and schemas). And indeed, multiple output documents. One of =
problems is that when you look at a template rule in isolation, you =
don't know whether it is writing part of the final result tree, or some
intermediate temporary tree whose structure is statically unknown.

We're not just looking at template rules in isolation - a central part of the analysis is to figure out the control flow between the templates.

Compared with Saxon's static analysis and type checking, you seem to be making a lot of what one might call "95% guesses". For example, you seem = to be assuming that if you see <xsl:template match=3D"x">, and there is an element x in the schema, then the actual element is going to be a valid instance of that type. (What if there's more than one x in the schema, a global one and a local one?) Saxon doesn't make this assumption; if you = want this kind of checking, you have to write match=3D"schema-element(x)". Of course the assumption is likely to be right 95% of the time, and for a "lint" kind of tool that's fine, but I don't feel it's appropriate for a production compiler. (We had a long debate about this at a W3C meeting = under the heading of "the assumption of validity").

Yes, our approach is a pragmatic program analysis. I view this as a supplement to static type checking, not as an alternative. And I'm also a big fan of static type checking :-)
The validator (based on dk.brics.schematools - see handles overloading (your example with more than one x) without problems. I'm not sure which assumptions you mean: the analysis is *sound* (i.e. over-approximating) - at least, that's the intention.

Clearly the more complex the stylesheet becomes, in terms of handling multiple inputs and temporary trees, the more likely it is that there = will be data around that for some good reason does NOT conform to the input schema.

The other technique that you're using which Saxon doesn't currently use =
to analyze across apply-templates calls. I think there's probably quite =
lot of mileage in doing this. In fact Saxon doesn't even analyze across
call-template or function calls unless you actually declare the type of =
result of the target template or function.

One thing that's interesting about the paper, I think, is to see how =
can be done simply with knowledge of the input and output schemas, =
any other changes to the stylesheet to incorporate type declarations and =
explicitly invoke validation. There's a lot to be said for this, because
XSLT programmers take a lot of re-educating to declare types of =
and parameters and the more you can do in the absence of such =
the better. I've been moving in similar directions with Saxon in some of =
optimizations that are done, for example trying to infer when

<xsl:variable name=3D"x"><xsl:value-of select=3D"EXP"/></xsl:variable>

can be safely rewritten as

<xsl:variable name=3D"x" select=3D"EXP"/>

by analyzing the expressions in which $x is used. Of course, when you're
doing optimization a 95% guess certainly isn't good enough, you need to =
absolutely sound. (Though you can use guesses to decide between =
evaluation strategies, of course, provided that the strategy still works =
the guess is wrong.)

It's interesting to note that you ask the user for two pieces of
information: the schema for the input document and the schema for the
result. In XSLT 2.0 we don't actually provide syntax to allow the user =
declare the schema for the input document. We tried to tackle that a =
of times but never quite found a solution that worked, but I'm really =
quite sure why not! It's partly of course that a complex stylesheet can =
multiple entry points and can be designed to do more than one job =
on the entry point that you choose. But again, that's the 5% case rather
than the 95%.

Thank you for your interesting comments! :-)

Anders Moeller

Current Thread