[xsl] Content constructors and sequences

Subject: [xsl] Content constructors and sequences
From: Jeni Tennison <jeni@xxxxxxxxxxxxxxxx>
Date: Wed, 9 Jan 2002 08:55:24 +0000
Hi,

I'd greatly appreciate comments on the following; I'll post to
xsl-editors@xxxxxx and www-xpath-comments@xxxxxx if the comments here
don't point out a glaring flaw.

Please post if you think it's a good idea, as well as if you think
it's a bad one, particularly if you can think of ways of improving the
strength of the argument.

Thanks,

Jeni

---

Executive summary
-----------------

Rather than XPath being continuously extended to allow it to do what
XSLT can already do, XSLT should be modified to support the thing that
it can't already do: sequence construction. This could be achieved by
amending the definition of content constructors in XSLT 2.0 and
introducing a new xsl:item instruction. This change would make XSLT
more consistent and more usable.


Contents
--------

1.  Requirement
2.  Sequence constructors
3.  Producing simple typed values and existing nodes
4.  Impact on XPath
5.  Impact on function definitions
6.  Impact on variable bindings
7.  Allowing rootless nodes
8.  Impact on result tree generation
9.  Conclusions
10. References


Requirement
-----------

Yesterday, David C. posted a message to www-xpath-comments@xxxxxx that
described how XPath is restricted by the lack of a general
variable-binding expression (let clause) [1].

I think that the lack of a let clause restricts what's practical in
XPath (even if it doesn't affect what's theoretically possible). For
example, with the for expression, you have to reconstruct any sequence
that you create within the for expression each time you use it, which
probably isn't particularly efficient and leads to maintenance
headaches. For example:

  for $o in $orders
  return if (count($o/item[(@price * @quantity) > 100]) > 5)
         then do:something($o/item[(@price * @quantity) > 100])
         else do:something-else($o/item[(@price * @quantity) > 100])

The way around this is with functions, because then you can use
xsl:variable to assign the variable:

  for $o in $orders
  return do:process-items($o)

and:

<xsl:function name="do:process-items">
  <xsl:param name="order" />
  <xsl:variable name="items"
                select="$order/item[(@price * @quantity) > 100]" />
  <xsl:result select="if (count($items) > 5)
                      then do:something($items)
                      else do:something-else($items)" />
</xsl:function>

but it's hardly ideal.

The same kind of problem occurs within an if expression within a for
expression, when certain variables are relevant within one branch of
the if and not in the other. For example:

  if ($string and $keyword)
  then if ((starts-with($string, $keyword) or
            ends-with(substring-before($string, $keyword), ' ')) and
           (not(substring-after($string, $keyword)) or
            starts-with(substring-after($string, $keyword), ' ')))
       then (substring-before($string, $keyword),
             $keyword,
             substring-after($string, $keyword))
       else $string
  else ()

which could be managed with:

  if ($string and $keyword)
  then (for $before in substring-before($string, $keyword),
            $after  in substring-after($string, $keyword)
        return if ((not($before) or ends-with($before, ' ')) and
                   (not($after) or starts-with($after, ' ')))
               then ($before, $keyword, $after)
               else $string
  else ()

but which would be much clearer (and more accurate, since you're not
really iterating) as:

  if ($string and $keyword)
  then (let $before := substring-before($string, $keyword),
            $after  := substring-after($string, $keyword)
        if ((not($before) or ends-with($before, ' ')) and
            (not($after) or starts-with($after, ' ')))
        then ($before, $keyword, $after)
        else $string
  else ()

Again, you could create a function to do the testing, but if we have
to generate new functions every time we want to bind variables, we're
going to have them coming out of our ears.

It's certainly true that you could add a let clause to XPath; you
could also add a where clause... and a sortby clause... and
typeswitches... and even element constructors... but what you end up
with is a replication of all the facilities of XSLT, but using a
non-XML syntax, and stuffed inside XML attributes.


Sequence constructors
--------------------

So I'd like to suggest an alternative. Instead of modifying XPath so
that it can do all the things that XSLT can do plus construct
sequences, why not modify XSLT so that it can construct general
sequences rather than just node sequences?

Doing this is (I *think*) simpler than it sounds. In XSLT 2.0,
"content constructors" are defined as [2]:

  "a sequence of nodes in the stylesheet that, when evaluated,
   constructs and returns a sequence of new nodes suitable for adding
   to the result tree. This sequence is referred to below as the
   result sequence."

If we modify that definition, so that "content constructors" don't
necessarily return *nodes* (they should probably then be called
"sequence constructors"):

   a sequence of nodes in the stylesheet that, when evaluated,
   constructs and returns a sequence. This sequence is referred to
   below as the result sequence.

We can amend the description of XSLT instructions in line with this:

XSLT instructions then produce a sequence of zero, one, or more items
as their result. These items are added to the result sequence. Some
instructions, such as xsl:element, return a newly-constructed node
(which may have its own attributes, namespaces, children, and other
descendants); others, such as xsl:if, return items produced by their
own nested sequence constructors.

[There are a couple of incompatibility problems here that I think can
 be handled; I'll come on to those later.]


Producing simple typed values and existing nodes
------------------------------------------------
 
All we need now is an element that can add a simple typed value or an
existing node to the result sequence. This could be achieved with an
xsl:item element:

  <!-- Category: instruction -->
  <xsl:item
    select = expression
    type = datatype>
    <!-- Content: sequence-constructor -->
  </xsl:item>

The xsl:item element works similarly to variable-binding elements: it
produces a sequence of items from either its select attribute or its
content. This enables you to add simple typed values or existing nodes
to a sequence.

For example, the equivalent to the for expression that we looked at
earlier would be:

  <xsl:variable name="new-orders" type="item*">
    <xsl:for-each select="$orders">
      <xsl:variable name="items"
                    select="item[(@price * @quantity) > 100]" />
      <xsl:item select="if (count($items) > 5)
                        then do:something($items)
                        else do:something-else($items)" />
    </xsl:for-each>
  </xsl:variable>

The $new-orders variable would have a value of a sequence of items.


Impact on XPath
---------------

Enabling XSLT to generate sequences will remove the requirement for
XPath to support expressions that involve range variables. For
example:

  <xsl:variable name="join" type="xs:integer*"
                select="for $i in (1, 2),
                            $j in (3, 4)
                        return ($i, $j)" />

could be done with:

  <xsl:variable name="join" type="xs:integer*">
    <xsl:for-each select="(1, 2)">
      <xsl:variable name="i" select="." />
      <xsl:for-each select="(3, 4)">
        <xsl:variable name="j" select="." />
        <xsl:item select="($i, $j)" />
      </xsl:for-each>
    </xsl:for-each>
  </xsl:variable>

[Of course a mapping operator would still be useful for simple cases.]
  
It would also remove the requirement for the sort() function (from
XSLT, and indeed named sort specifications altogether) or the adoption
of the sortby clause from XQuery, since the existing xsl:sort can be
used.

For example, instead of:

  <xsl:sort-key name="subtotal-sort">
    <xsl:sort select="@price * @quantity" data-type="number"
              order="descending" />
    <xsl:sort select="@part-id" order="ascending" />
  </xsl:sort-key>
  <xsl:variable name="sorted-items"
                select="sort($items, 'subtotal-sort')" />

you could do:

  <xsl:variable name="sorted-items">
    <xsl:for-each select="$items">
      <xsl:sort select="@price * @quantity" data-type="number"
                order="descending" />
      <xsl:sort select="@part-id" order="ascending" />
      <xsl:item select="." />
    </xsl:for-each>
  </xsl:variable>


Impact on function definitions
------------------------------

Adding the xsl:item element allows us to get rid of the xsl:result
element when defining functions. The xsl:function element's new syntax
would be:

<xsl:function
  name = qname>
  <!-- Content: (xsl:param*, sequence-constructor) -->
</xsl:function>

The xsl:function element would simply return the sequence produced by
its content constructor.

For example:

  <xsl:function name="my:split-string">
    <xsl:param name="string" type="xs:string" />
    <xsl:param name="keyword" type="xs:string" />
    <xsl:if test="$string and $keyword">
      <xsl:variable name="before"
                    select="substring-before($string, $keyword)" />
      <xsl:variable name="after"
                    select="substring-after($string, $keyword)" />
      <xsl:item select="if (not($before) or ends-with($before, ' ')) and
                           (not($after) or starts-with($after, ' '))
                        then ($before, $keyword, $after)
                        else $string" />
    </xsl:if>
  </xsl:result>


Impact on variable bindings
---------------------------

The current XSLT 2.0 WD states:

  "[ERR030] Elements such as xsl:variable, xsl:param, xsl:message,
   and xsl:result-document construct a new document node, and use the
   result sequence returned by the content constructor to form the
   children of this document node. In this case it is an dynamic error
   if the result sequence contains namespace or attribute nodes. The
   processor must either signal the error, or must recover by ignoring
   the offending nodes. The elements, comments, processing
   instructions, and text nodes in the node sequence form the children
   of the newly constructed document node."

I'll concentrate on variable-binding elements here (xsl:message and
xsl:result-document are handled in the next section).

Supporting the creation of sequences means that rather than create a
new document node, variable-binding elements must bind the variable to
the result sequence produced by their sequence constructor. This
sequence must be able to contain all kinds of nodes.

There is a backwards incompatibility here - if a variable is assigned
a value through the content of the variable-binding element, then
rather than conceptually holding the "root node of the result tree
fragment" as in XSLT 1.0, the variable holds a sequence of items
(nodes, assuming you're using the variable as in XSLT 1.0).

Currently, when users get the string value of a result tree fragment,
they get the string value of the *root node* of the result tree
fragment - the concatenation of the string values of the text node
descendants in the result tree fragment.

On the other hand, when users get the string value of a sequence, they
get the string value of the first item in the sequence.

Therefore if you have:

  <xsl:variable name="foo">
    <element>A</element>
    <element>B</element>
  </xsl:variable>

then string($foo) will give "AB" in XSLT 1.0 and just "A" in XSLT 2.0
(if sequence constructors were supported).

[I don't think that people get the string values of result tree
 fragments that contain elements very often because it's rarely useful
 to create a result tree fragment with internal structure and then
 proceed to ignore that internal structure, but it does happen.]

Another difference applies if people are used to using node-set()
extension functions to convert variables to node sets. As there is no
document node, addressing the items in the sequence does not involve
stepping down to them.

For example, given the above definition of $foo, the equivalent of the
following in XSLT 1.0:

  <xsl:for-each select="exsl:node-set($foo)/element">
    ...
  </xsl:for-each>

is simply:

  <xsl:for-each select="$foo">
    ...
  </xsl:for-each>

[There's an argument that XSLT 2.0 shouldn't have to worry about
 backwards compatibility with extension functions, but the node-set()
 extension function is very widely used and is based on the
 description of result tree fragments from XSLT 1.0.]
 
These backwards compatibility issues could be resolved by having the
type attribute on the variable-binding element determine the behaviour
of the variable-binding element. If the type attribute is not present,
then the variable-binding element creates a result tree (as described
later), and the variable is bound to a new document node; if the type
attribute is specified, then the variable is bound to the sequence.

[This is similar to the role played by the separator attribute on
 xsl:value-of.]


Allowing rootless nodes
-----------------------

Section 3.1 of the XSLT 2.0 WD [3] states:

  "The data model defined in [Data Model] allows a node to be part of
   a tree whose root is a node other than a document node.

  "Although such nodes may exist transiently during the course of XSLT
   processing, every node that is processed by an XSLT stylesheet
   (that is, a node that may be returned in the result of an
   expression) will belong to a tree whose root is a document node."

This will no longer be true. It will be possible to create sequences
containing nodes that do not have a parent.

I'm not certain why this restriction applies in XSLT, especially as it
is not a restriction in the data model or in XQuery. There might be
something here that causes problems for the whole
sequence-generation-using-content-constructors idea, but I'm not sure
what it would be.

If the suggestion for retaining backwards compatibility with
variable-binding elements is used, then if XSLT 2.0 is used like XSLT
1.0 (i.e. without type attributes on variable-binding elements, and
without user-defined functions) it is still true that every node that
may be returned in the result of an expression will belong to a tree
whose root is a document node.


Impact on result tree generation
--------------------------------

The final impact of this change is on result tree generation. This
applies to the construction of the content of element nodes, principal
result tree, secondary result trees, messages, and tree variables
(those without a type attribute). It also applies, slightly
differently, to the construction of comment, attribute, processing
instruction, text and namespace nodes (which I'll call simple nodes
so that I don't have to repeat their names constantly).

Currently, content constructors construct a sequence of nodes, and
this sequence of nodes can be made into a result tree by adding a
parent node, or converted to a string to be used as the value of a
simple node. Under certain circumstances, the presence of certain
types of nodes in the node sequence is a recoverable dynamic error
(e.g. attribute nodes when creating a document; element nodes when
getting the string value for an attribute).

If we had the more general sequence constructors, result trees would
need to be constructed from sequences containing any mixture of simple
typed values and nodes (both newly created (rootless) and pre-existing
(rooted)), rather than those containing just newly created nodes.

Pre-existing nodes can be differentiated from newly created nodes by
the fact that they already have a parent, are already part of a tree,
and are therefore not rootless. With pre-existing nodes, there are
three options:

 - the pre-existing node is (deep) copied, and replaced in the
   sequence by the newly created copy (often inappropriate when
   the sequence provides a value for a simple node)

 - the pre-existing nodes is ignored
   
 - the presence of a pre-existing node in a sequence that's used to
   generate a result tree is a dynamic error, with one of the two
   above options as a recovery action

Similarly, there are three options for simple typed values:

 - the string value of the simple typed value is used as the value
   for a newly created text node, and replaced in the sequence by this
   newly created text node (which would have to be concatenated with
   surrounding text nodes)

 - the simple typed value is ignored

 - the presence of a simple typed value in a sequence that's used to
   generate a result tree is a dynamic error, with one of the two
   above options as a recovery action

In both cases I think that it's reasonable to make it an error, with
the creation of a node as a recovery action. Conceptually, the
sequence could be treated in exactly the same way as currently after
pre-existing nodes and simple typed values are substituted.


Conclusions
-----------

If XPath were extended to be a usable method of generating sequences,
it would end up replicating the variable assignment and flow control
features that are already available within XSLT. While there is an
argument for constructing a language that performs transformations
without using XML syntax, that niche is already filled by XQuery. In
addition, because XPaths are used within attributes in XSLT, XSLT with
extended XPath will become a lot harder to read, write, and maintain
than the equivalent XSLT instructions.

Extending the concept of 'content constructors' to more general
'sequence constructors' and introducing an xsl:item element to add
simple typed values and pre-existing nodes to this sequence gives XSLT
the power to construct sequences of all descriptions. Rather than
learning one language for constructing sequences of nodes and a
different language with similar constructs for constructing other
sequences, you will only have to learn one, unified, language.


References
----------

[1] http://lists.w3.org/Archives/Public/www-xpath-comments/2002JanMar/0026.html
[2] http://www.w3.org/TR/xslt20/#dt-content-constructor
[3] http://www.w3.org/TR/xslt20/#rootless-nodes

---
Jeni Tennison
http://www.jenitennison.com/


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread