Re: [xsl] mixed content grouping by whitespace

Subject: Re: [xsl] mixed content grouping by whitespace
From: Wendell Piez <wapiez@xxxxxxxxxxxxxxxx>
Date: Tue, 13 Apr 2010 11:47:12 -0400
Hi,

On Gerrit's excellent explanation of group-adjacent....

At 06:52 PM 4/12/2010, he wrote:
This groups the nodes in the variable you've created by the boolean
(so the truth or falsehood of whether the pattern matches? I didn't
know you could do that in a group-* pattern) of the existence of the
segs you've created on tei:seg/text() which mark the whitespace.

There are two flavours of grouping conditions: patterns and expressions. group-starting/ending-with require patterns while group-by and group-adjacent accept any XPath expression. The latter are being applied to each item of the so-called population in order to calculate grouping keys, the former match specific nodes in the population that will lead or terminate a group.

It's really helpful to keep this distinction in mind. One sort of grouping works with a key; @group-by or @group-adjacent calculates that key. The other sort simply applies a match criterion to each node in the group to determine whether it's the particular sort of node (group-starting or group-ending) of interest for that sort of grouping.


For all but the nodes marked-up as WS in our example, evaluating self::tei:seg[@type='sep'] yields the empty sequence. Since the empty sequence cannot be used as a grouping key for group-adjacent [1], its boolean value is calculated, which is false for empty sequences [2]. I could have used empty() instead of boolean() which would just flip each node's true()/false() key. In this case, I would have to swap the "when current-grouping-key" and the "otherwise" actions accordingly, or test="not(current-grouping-key())".

Indeed; and "not(self::tei:seg[@type='sep'])" would work like empty().


Similarly, "exists(self::tei:seg[@type='sep'])" would work like boolean().

The main thing is that splitting logic is really "group-adjacent" logic in which the key is used to assign nodes to the categories for splitting. Another illustration of this principle would be group-adjacent="ceiling(position() div 5)", which splits into groups of five members (with the last group given the remainder).

Here (the most common case for splitting) those categories are two, hence the expressions returning Boolean values. Booleans are nice since we can then examine current-grouping-key() straightforwardly with a test to tell which sort of group (of the two sorts) one is in.

In the word wrap example, it's a matter of taste whether to use group-starting-with or group-adjacent. But try to tackle the group-adjacent example given in the spec [3] using group-starting-with (or group-ending-with), and you'll find yourself writing all kinds of complicated lookaheads and lookbehinds that for-each-group promised to liberate you from. The same holds for trying to solve group-starting-with problems using group-adjacent. There's a reason THey created all 4 forms of for-each-group. And THey saw it was good.

Sometimes it's a matter of taste, and sometimes it's a tough call; but group-adjacent is frequently more elegant.


Cheers,
Wendell



======================================================================
Wendell Piez                            mailto:wapiez@xxxxxxxxxxxxxxxx
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
  Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================

Current Thread