RE: XPath's role (Was: Re: [xsl] Re: . in for)

Subject: RE: XPath's role (Was: Re: [xsl] Re: . in for)
From: "Michael Kay" <michael.h.kay@xxxxxxxxxxxx>
Date: Sun, 6 Jan 2002 20:12:39 -0000
Kevin Jones:

> I tend to agree that the justification for the flow control
> additions, 'for'
> and to a lesser extent 'if', is somewhat suspect in an
> expression languages.

Attached, for information, is a position paper which was used during the WG
deliberations to justify the decision. It was drafted by me under the
direction of the XSL WG, and accepted by the group.

To summarise the argument as a one-liner: XPath is indeed an expression
language, and it needs to be a complete expression language over the data
model, which means it must be able to process sequences as well as scalars.


Contents:
1. Conditional expressions in XPath
1.1 Requirement
1.2 Composability

2. Sequences in XPath
2.1 Requirement

3. FOR expression in XPath
3.1 Requirement
3.2 Composability
3.3 Are range variables needed?

1. Conditional expressions in XPath
1.1 Requirement

The published XPath 2.0 requirements statement [1] includes an explicit
requirement for conditional expressions to be added to the language. The
requirement is classified as a "must": see section 2.2 of the document for
some use cases.

The requirement can be traced back to Appendix G of the XSLT 1.0
specification, which lists "conditional expressions" as its first bullet
point in the list of features to be considered for a future release.

Why are conditional expressions needed?

In XSLT 2.0, it is possible to execute instructions conditionally by using
<xsl:choose>. This provides the closest equivalent to a conditional
expression. It allows one to write:

   <xsl:variable name="x">
      <xsl:choose>
      <xsl:when test="$a=1"><xsl:value-of select="$b"/></xsl:when>
      <xsl:otherwise><xsl:value-of select="$c"/></xsl:when>
      </xsl:choose>
   </xsl:variable>

There are four drawbacks to this solution.

(a) it is verbose, and tedious both to read and to write
(b) the value of the variable is always a tree, never (say) a string or a
number. Fortunately XSLT allows a tree to be implicitly converted to a
string or a number, so this is not a serious drawback in practice, except
when you want a data type other than a string or a number - notably, when
you want a node-set. (The constructed tree can contain copies of nodes in
the source tree, but of course it can't contain the original nodes)
(c) this approach can't be used to calculate a value that depends on the
context node, which you need to do for example in a sort key or in a
predicate within a path expression
(d) the code is difficult to optimize; it's not easy to recognize the
situations where it is possible to avoid the overhead of actually
constructing a tree in memory. For example, the semantics require that if
this code is invoked twice, it creates two trees with identical content but
distinct identity - in other words, the code is not a pure function, it has
side-effects.

Various workarounds have been invented to cope with these limitations, and
users are often advised to use these techniques when they raise queries on
help forums:

For returning a node-set conditionally, the construct

    $debits[$condition] | $credits[not($condition)]

is used. This selects every node in $debits if the condition is true, and
every node in $credits if the condition is false.

For returning a string conditionally, the construct

   concat( substring($x, number(not($condition))*string-length($x)+1),
               substring($y, number($condition)*string-length($y)+1))

is used. This returns the string $x if $condition is true, and $y if
$condition is false.

The fact that people have gone to the lengths of inventing such constructs
and promoting them on FAQ sites is to my mind adequate evidence that there
is a need for conditional expressions in XPath.

1.2 Composability

Are conditional expressions needed everywhere in XPath? There has been a
suggestion they should only be allowed "at the top level" (whatever that
means). I can't see any justification for any restrictions. Wherever a
value is needed, a conditional expression might be needed.

2. Sequences in XPath

2.1 Requirement

The XPath 2.0 requirements statement [1] includes an explicit requirement
for sequences to be added to the language. (4.4: should add list data type
to the type system of the expression language). This appears in the section
of the document dealing with support for XML Schema, and is justified
primarily because schema allows elements and attributes to have a sequence
as their typed-value. It's worth noting that the requirement as stated is
confined to sequences of simple values, it doesn't extend to sequences of
nodes.

2.1a Requirement for Sequences of Simple-Values

Sequences of simple values are needed primarily because Schema supports
them. We want to use the typed value of elements and attributes because
this gives greater semantic knowledge than just using the string value; and
the
typed value may be a sequence. Therefore we need sequences in the data
model.

Sequences of simple values also often arise as intermediate values in
calculations. The classic example is the requirement to total the aggregate
value of price times quantity over a set of elements. We first need to form
the sequence (actually, the bag) of values of price times quantity, then to
total the values in this sequence (see XPath 2.0 requirement 2.5: "should
support aggregation functions over collection-valued expressions"). Similar
situations occur frequently in stylesheets attempting to generate SVG
graphics or to calculate how many records to output before a page break.
Sequences of strings often arise when tokenizing data formatted with spaces
or commas as separators.

The XSLT 1.0 solution to such problems is usually to generate the sequence
of values as text nodes on a temporary tree. This is viable (given the
widely-available xx:node-set() extension function), but inconvenient.
Creating a tree has substantial overheads compared with a simple sequence:
the strings or numbers need to be wrapped up in element nodes, each of which
needs to support operations on its identity, document order, all the axes,
name, namespace, base URI, and so on, none of which is actually needed for
the task in hand.

The availability of sequences of integers also solves another common XSLT
requirement, the need to iterate round a loop a fixed number of times:

    <xsl:for-each select="1 to 5"><td/></xsl:for-each>

In XSLT 1.0 this requires either a recursive named template, or the "trick"
solution

   <xsl:for-each select="(//node())[position() &lt;=
5]"><td/></xsl:for-each>


2.1b Requirement for Sequences of Nodes

XSLT 1.0 can process nodes in sorted order (that is, an order other than
document order), but it cannot save a sorted list of nodes in a variable.
This creates a problem whenever there is a need to do further manipulation
of the sorted list, for example numbering it, or selecting the top n items
in sorted order.

Again, a common solution is to use temporary trees. However, these cannot
contain the original nodes, only copies of the nodes. This leads to
overheads (creating objects is expensive), and means that the full context
of the original nodes is not available when processing the copies.

The need to hold a sorted list of nodes in a variable does not arise very
frequently (which is why it isn't explicit in the published requirements
statement), but when it does arise, the lack of this capability can
currently lead to gross distortions of the stylesheet logic.

3. FOR expression in XPath

3.1 Requirement

The requirement for a FOR expression in XPath is not explicit in the
published requirements statement, rather it is implicit in the need to
support sequences.

The XSLT <xsl:for-each> and <xsl:apply-templates/> instructions provide the
ability to iterate over an existing sequence. They do not allow a new
sequence to be constructed. This is because, like all XSLT instructions,
they create nodes on trees. The problem is thus exactly the same as with
<xsl:choose>: it gives iterative processing, but not a mapping expression.
Of course, many problems can be solved without a mapping expression, by
iterating over the supplied sequence and performing a calculation on each
item in the sequence as it is processed. But this isn't always possible: the
classic example is again forming the aggregate of "price times quantity".
Without the ability to construct a sequence representing the values of
price times quantity, there is no way to solve this problem. Of course, it
can be
solved by constructing a temporary tree rather than a sequence, but we have
already seen the disadvantages of that approach.

So the canonical use case for a FOR expression in XPath is

   sum( for $i in //item return $i/@price * $i/@quantity )

Note that this particular example maps a set of nodes to a sequence of
numbers (assuming that the nodes are dereferenced to numbers by virtue of
being used in the sum() function). Mapping of values to values, values to
nodes, or nodes to nodes can also be useful. Here are examples of each:

Values to values:

  average( for $w in $words return string-length($w) )

Values to nodes:

  for $p in $params return //item[@name=$p]

Nodes to nodes:

  for $o in $order/item/@product-code return //product[@code=$o]

(Note that it this last example, we could have used a path expression, but
then we would have got the products in the sequence they appear in the
source document, not the sequence they are listed in the order-items within
a particular order.)

3.2 Composability

Do we need FOR expressions anywhere in the language, or should there be
restrictions?

In general, composability is a good thing. We should only introduce
restrictions with very good reasons.

There are, of course, expressions that don't work either for data type
reasons or because operators are not defined over all input values. In a
language without implicit type conversions, (3+"a") and (5 div 0) will both
return errors. There are also expressions in most languages that will
produce useless or surprising results.

In general, I think it's bad language design to try to deal with these
problems by means of grammatical restrictions on what constructs can be used
where. Firstly, it's hard to write such restrictions in a way that can't be
easily circumvented, e.g. by putting the offending expression in brackets or
hiding it in a function call. The underlying restriction (e.g. that division
by zero is an error) has to be enforced dynamically, so most languages make
no attempt to detect it statically.

There seem to be two reasons for wanting to restrict the composability of
FOR expressions. One is grammatical, to do with difficulties in parsing: if
those objections are valid, then we should fix the syntax, rather than
restricting the functionality. The other is the problem that FOR generates a
sequence, and using the result in a context where duplicates are eliminated
may cause user surprises. The obvious such constructs are the union operator
"|" and the path operator "/". Equally, one might ask whether it makes sense
to use FOR in a context that expects a singleton, for example as an operand
of the "+" operator, or as an argument to the name() function. Should a FOR
expression be allowed to appear in such contexts?

I would argue that it should.

Firstly, the user might know that a particular FOR expression is going to
produce a singleton. In most such cases, it would be possible to use a path
expression instead: I can write

   //item[@code=$x]
rather than
  for $i in //item return if ($i/@code=$x) then $i else ()

But since these two expressions are exactly equivalent, why should I allow
one and disallow the other? (Such restrictions make life very difficult for
software that is generating queries.)

In any case, path expressions have many restrictions not shared by FOR
expressions: they can't use range variables, which restricts their ability
to do joins, and they can't operate on sequences of simple values.
Therefore, it seems wrong to provide contexts in which path expressions are
allowed but FOR expressions aren't.

Secondly, the user might know that the FOR expression is going to produce a
set of distinct nodes in document order. Again, we shouldn't force him to
write a path expression rather than a FOR expression if that's what he
prefers.

In summary, if we restrict composability it might enable us to detect some
user mistakes, but it's also going to be a great irritant to users who are
trying to do perfectly legitimate things.

3.3 Are range variables needed?

If we allow some kind of FOR expression (that is, an expression that maps a
sequence to another sequence), do we need range variables, or can we rely
on the XPath 1.0 mechanism of a single implicit range variable, "."?

I've come to the conclusion that we do need to allow range variables, though
there are definitely pros and cons: introducing range variables into XPath
is without doubt a step increase in the complexity of the language that we
need to consider very carefully.

Range variables are needed to do general joins. To put it another way, they
are needed to make the language "relationally complete", equivalent in
power to first order predicate calculus. The inability to do general joins
in
XPath 1.0 has been the subject of criticism in the past, mainly from
theorists.

In practice, joins don't arise that often in XPath 1.0, because most
processing follows the hierarchy of the source document; and where joins
are needed, they are usually simple equijoins, so the implicit existential
semantics of "=" solve the problem. The availability of sequences in XPath
2.0 changes this, as does the increased reliance on operators that don't
have existential semantics, such as compare() which is needed to compare
two strings using a named collation.

With sequences, it will be much more common to process several collections
(trees or sequences) at the same time and do joins between them. It's easy
to find use cases where the single "." is obviously inadequate, for example
you can't rewrite:

   for $i in 1 to 10 return $items[$i]

without a range variable. In this case you could write

   $items[position() < 11]

but what about

   for $i in (count($items) to count($items) - 5) return $items[$i]

Here we are not returning output in document order, so we can't fall back
on the boolean test against position().

There are plenty of other examples. A more obvious join example is to find
duplicates in two files, where duplication is tested under a collation:

for $i in document('doc1.xml')//item,
    $j in document('doc2.xml')//item
    return
      if (compare($i/@code, $j/@code, 'case-folded-collation"))
      then $i else ()

(This use case points towards adding WHERE as well, but I'll steer clear of
that...)

So I think range variables are needed; but I have an open mind on whether
range variables should be mandatory or optional.

Mike Kay

[1]: XPath 2.0 Requirements http://www.w3.org/TR/xpath20req




 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread