Subject: RE: XPath's role (Was: Re: [xsl] Re: . in for) From: "Michael Kay" <michael.h.kay@xxxxxxxxxxxx> Date: Sun, 6 Jan 2002 20:12:39 -0000 |
Kevin Jones: > I tend to agree that the justification for the flow control > additions, 'for' > and to a lesser extent 'if', is somewhat suspect in an > expression languages. Attached, for information, is a position paper which was used during the WG deliberations to justify the decision. It was drafted by me under the direction of the XSL WG, and accepted by the group. To summarise the argument as a one-liner: XPath is indeed an expression language, and it needs to be a complete expression language over the data model, which means it must be able to process sequences as well as scalars. Contents: 1. Conditional expressions in XPath 1.1 Requirement 1.2 Composability 2. Sequences in XPath 2.1 Requirement 3. FOR expression in XPath 3.1 Requirement 3.2 Composability 3.3 Are range variables needed? 1. Conditional expressions in XPath 1.1 Requirement The published XPath 2.0 requirements statement [1] includes an explicit requirement for conditional expressions to be added to the language. The requirement is classified as a "must": see section 2.2 of the document for some use cases. The requirement can be traced back to Appendix G of the XSLT 1.0 specification, which lists "conditional expressions" as its first bullet point in the list of features to be considered for a future release. Why are conditional expressions needed? In XSLT 2.0, it is possible to execute instructions conditionally by using <xsl:choose>. This provides the closest equivalent to a conditional expression. It allows one to write: <xsl:variable name="x"> <xsl:choose> <xsl:when test="$a=1"><xsl:value-of select="$b"/></xsl:when> <xsl:otherwise><xsl:value-of select="$c"/></xsl:when> </xsl:choose> </xsl:variable> There are four drawbacks to this solution. (a) it is verbose, and tedious both to read and to write (b) the value of the variable is always a tree, never (say) a string or a number. Fortunately XSLT allows a tree to be implicitly converted to a string or a number, so this is not a serious drawback in practice, except when you want a data type other than a string or a number - notably, when you want a node-set. (The constructed tree can contain copies of nodes in the source tree, but of course it can't contain the original nodes) (c) this approach can't be used to calculate a value that depends on the context node, which you need to do for example in a sort key or in a predicate within a path expression (d) the code is difficult to optimize; it's not easy to recognize the situations where it is possible to avoid the overhead of actually constructing a tree in memory. For example, the semantics require that if this code is invoked twice, it creates two trees with identical content but distinct identity - in other words, the code is not a pure function, it has side-effects. Various workarounds have been invented to cope with these limitations, and users are often advised to use these techniques when they raise queries on help forums: For returning a node-set conditionally, the construct $debits[$condition] | $credits[not($condition)] is used. This selects every node in $debits if the condition is true, and every node in $credits if the condition is false. For returning a string conditionally, the construct concat( substring($x, number(not($condition))*string-length($x)+1), substring($y, number($condition)*string-length($y)+1)) is used. This returns the string $x if $condition is true, and $y if $condition is false. The fact that people have gone to the lengths of inventing such constructs and promoting them on FAQ sites is to my mind adequate evidence that there is a need for conditional expressions in XPath. 1.2 Composability Are conditional expressions needed everywhere in XPath? There has been a suggestion they should only be allowed "at the top level" (whatever that means). I can't see any justification for any restrictions. Wherever a value is needed, a conditional expression might be needed. 2. Sequences in XPath 2.1 Requirement The XPath 2.0 requirements statement [1] includes an explicit requirement for sequences to be added to the language. (4.4: should add list data type to the type system of the expression language). This appears in the section of the document dealing with support for XML Schema, and is justified primarily because schema allows elements and attributes to have a sequence as their typed-value. It's worth noting that the requirement as stated is confined to sequences of simple values, it doesn't extend to sequences of nodes. 2.1a Requirement for Sequences of Simple-Values Sequences of simple values are needed primarily because Schema supports them. We want to use the typed value of elements and attributes because this gives greater semantic knowledge than just using the string value; and the typed value may be a sequence. Therefore we need sequences in the data model. Sequences of simple values also often arise as intermediate values in calculations. The classic example is the requirement to total the aggregate value of price times quantity over a set of elements. We first need to form the sequence (actually, the bag) of values of price times quantity, then to total the values in this sequence (see XPath 2.0 requirement 2.5: "should support aggregation functions over collection-valued expressions"). Similar situations occur frequently in stylesheets attempting to generate SVG graphics or to calculate how many records to output before a page break. Sequences of strings often arise when tokenizing data formatted with spaces or commas as separators. The XSLT 1.0 solution to such problems is usually to generate the sequence of values as text nodes on a temporary tree. This is viable (given the widely-available xx:node-set() extension function), but inconvenient. Creating a tree has substantial overheads compared with a simple sequence: the strings or numbers need to be wrapped up in element nodes, each of which needs to support operations on its identity, document order, all the axes, name, namespace, base URI, and so on, none of which is actually needed for the task in hand. The availability of sequences of integers also solves another common XSLT requirement, the need to iterate round a loop a fixed number of times: <xsl:for-each select="1 to 5"><td/></xsl:for-each> In XSLT 1.0 this requires either a recursive named template, or the "trick" solution <xsl:for-each select="(//node())[position() <= 5]"><td/></xsl:for-each> 2.1b Requirement for Sequences of Nodes XSLT 1.0 can process nodes in sorted order (that is, an order other than document order), but it cannot save a sorted list of nodes in a variable. This creates a problem whenever there is a need to do further manipulation of the sorted list, for example numbering it, or selecting the top n items in sorted order. Again, a common solution is to use temporary trees. However, these cannot contain the original nodes, only copies of the nodes. This leads to overheads (creating objects is expensive), and means that the full context of the original nodes is not available when processing the copies. The need to hold a sorted list of nodes in a variable does not arise very frequently (which is why it isn't explicit in the published requirements statement), but when it does arise, the lack of this capability can currently lead to gross distortions of the stylesheet logic. 3. FOR expression in XPath 3.1 Requirement The requirement for a FOR expression in XPath is not explicit in the published requirements statement, rather it is implicit in the need to support sequences. The XSLT <xsl:for-each> and <xsl:apply-templates/> instructions provide the ability to iterate over an existing sequence. They do not allow a new sequence to be constructed. This is because, like all XSLT instructions, they create nodes on trees. The problem is thus exactly the same as with <xsl:choose>: it gives iterative processing, but not a mapping expression. Of course, many problems can be solved without a mapping expression, by iterating over the supplied sequence and performing a calculation on each item in the sequence as it is processed. But this isn't always possible: the classic example is again forming the aggregate of "price times quantity". Without the ability to construct a sequence representing the values of price times quantity, there is no way to solve this problem. Of course, it can be solved by constructing a temporary tree rather than a sequence, but we have already seen the disadvantages of that approach. So the canonical use case for a FOR expression in XPath is sum( for $i in //item return $i/@price * $i/@quantity ) Note that this particular example maps a set of nodes to a sequence of numbers (assuming that the nodes are dereferenced to numbers by virtue of being used in the sum() function). Mapping of values to values, values to nodes, or nodes to nodes can also be useful. Here are examples of each: Values to values: average( for $w in $words return string-length($w) ) Values to nodes: for $p in $params return //item[@name=$p] Nodes to nodes: for $o in $order/item/@product-code return //product[@code=$o] (Note that it this last example, we could have used a path expression, but then we would have got the products in the sequence they appear in the source document, not the sequence they are listed in the order-items within a particular order.) 3.2 Composability Do we need FOR expressions anywhere in the language, or should there be restrictions? In general, composability is a good thing. We should only introduce restrictions with very good reasons. There are, of course, expressions that don't work either for data type reasons or because operators are not defined over all input values. In a language without implicit type conversions, (3+"a") and (5 div 0) will both return errors. There are also expressions in most languages that will produce useless or surprising results. In general, I think it's bad language design to try to deal with these problems by means of grammatical restrictions on what constructs can be used where. Firstly, it's hard to write such restrictions in a way that can't be easily circumvented, e.g. by putting the offending expression in brackets or hiding it in a function call. The underlying restriction (e.g. that division by zero is an error) has to be enforced dynamically, so most languages make no attempt to detect it statically. There seem to be two reasons for wanting to restrict the composability of FOR expressions. One is grammatical, to do with difficulties in parsing: if those objections are valid, then we should fix the syntax, rather than restricting the functionality. The other is the problem that FOR generates a sequence, and using the result in a context where duplicates are eliminated may cause user surprises. The obvious such constructs are the union operator "|" and the path operator "/". Equally, one might ask whether it makes sense to use FOR in a context that expects a singleton, for example as an operand of the "+" operator, or as an argument to the name() function. Should a FOR expression be allowed to appear in such contexts? I would argue that it should. Firstly, the user might know that a particular FOR expression is going to produce a singleton. In most such cases, it would be possible to use a path expression instead: I can write //item[@code=$x] rather than for $i in //item return if ($i/@code=$x) then $i else () But since these two expressions are exactly equivalent, why should I allow one and disallow the other? (Such restrictions make life very difficult for software that is generating queries.) In any case, path expressions have many restrictions not shared by FOR expressions: they can't use range variables, which restricts their ability to do joins, and they can't operate on sequences of simple values. Therefore, it seems wrong to provide contexts in which path expressions are allowed but FOR expressions aren't. Secondly, the user might know that the FOR expression is going to produce a set of distinct nodes in document order. Again, we shouldn't force him to write a path expression rather than a FOR expression if that's what he prefers. In summary, if we restrict composability it might enable us to detect some user mistakes, but it's also going to be a great irritant to users who are trying to do perfectly legitimate things. 3.3 Are range variables needed? If we allow some kind of FOR expression (that is, an expression that maps a sequence to another sequence), do we need range variables, or can we rely on the XPath 1.0 mechanism of a single implicit range variable, "."? I've come to the conclusion that we do need to allow range variables, though there are definitely pros and cons: introducing range variables into XPath is without doubt a step increase in the complexity of the language that we need to consider very carefully. Range variables are needed to do general joins. To put it another way, they are needed to make the language "relationally complete", equivalent in power to first order predicate calculus. The inability to do general joins in XPath 1.0 has been the subject of criticism in the past, mainly from theorists. In practice, joins don't arise that often in XPath 1.0, because most processing follows the hierarchy of the source document; and where joins are needed, they are usually simple equijoins, so the implicit existential semantics of "=" solve the problem. The availability of sequences in XPath 2.0 changes this, as does the increased reliance on operators that don't have existential semantics, such as compare() which is needed to compare two strings using a named collation. With sequences, it will be much more common to process several collections (trees or sequences) at the same time and do joins between them. It's easy to find use cases where the single "." is obviously inadequate, for example you can't rewrite: for $i in 1 to 10 return $items[$i] without a range variable. In this case you could write $items[position() < 11] but what about for $i in (count($items) to count($items) - 5) return $items[$i] Here we are not returning output in document order, so we can't fall back on the boolean test against position(). There are plenty of other examples. A more obvious join example is to find duplicates in two files, where duplication is tested under a collation: for $i in document('doc1.xml')//item, $j in document('doc2.xml')//item return if (compare($i/@code, $j/@code, 'case-folded-collation")) then $i else () (This use case points towards adding WHERE as well, but I'll steer clear of that...) So I think range variables are needed; but I have an open mind on whether range variables should be mandatory or optional. Mike Kay [1]: XPath 2.0 Requirements http://www.w3.org/TR/xpath20req XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: XPath's role (Was: Re: [xsl] Re, Jeni Tennison | Thread | Re: XPath's role (Was: Re: [xsl] Re, Jeni Tennison |
[xsl] RE: Re: . in for, Dimitre Novatchev | Date | Re: [xsl] Re: Assignment no, dynami, Terje Norderhaug |
Month |