Re: [xsl] breaking up XML on page break element

Subject: Re: [xsl] breaking up XML on page break element
From: "Michael Kay mike@xxxxxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 16 Jul 2014 22:17:40 -0000
As Gerrit says, there are considerable challenges in streaming this.

As the error message indicates, Saxon 9.5 is quite a way behind the W3C spec
in terms of ability to do streamed grouping. It's been one of the toughest
implementation challenges. Saxon 9.6 will give improved coverage, but still
with restrictions.

There are all sorts of streaming restrictions violated by this code, e.g.
binding a streamed node to a variable (and to a template parameter), multiple
downward selections, etc.

The call on select="descendant::node()[not(node())]" is problematic because
the analysis can't tell that this is selecting a "striding" node-set (that is,
a sequence of nodes none of which is an ancestor of any other). I think the
"spec" solution would be to write

select="outermost(descendant::node()[not(has-children(.))])"

but that's not going to work yet in Saxon (making has-children() motionless
requires a bit of lookahead which Saxon streaming doesn't yet implement).

But the real problem is that the logic is going down to descendants, then up
to their ancestors, and then down again, and that's intrinsically not
processing nodes in document order, which is a precondition for streaming.

I suspect that any streaming algorithm for this is going to have to be
tag-based rather than node-based (when X occurs, output these end tags...) and
that's out of scope for the XSLT 3.0 streaming model.

But never say never, there might be some kind of way to do it, perhaps a
multi-phase streaming solution of some kind.

Michael Kay
Saxonica
mike@xxxxxxxxxxxx
+44 (0118) 946 5893



On 16 Jul 2014, at 17:46, Michael MC<ller-Hillebrand mmh@xxxxxxxxx
<xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx> wrote:

> Hi Gerrit,
>
> What a great, cool, solution! I think of applying this to split FO
page-sequences.
>
> Because we will have to deal with very large documents the question of
streaming comes to my mind.
> I have not yet looked up the many many restrictions for streamable
expressions, but does anyone of you have a feeling whether streaming could be
used here? Like: Process each segment of relevant nodes separately.
>
> Thanks for your suggestions,
>
> - Michael
>
> Am 04.07.2014 um 20:20 schrieb Imsieke, Gerrit, le-tex
gerrit.imsieke@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>:
>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>> version="2.0">
>>
>> <xsl:output indent="yes"/>
>>
>> <xsl:template match="* | @*" mode="#default">
>>   <xsl:copy>
>>     <xsl:apply-templates select="@*, node()" mode="#current"/>
>>   </xsl:copy>
>> </xsl:template>
>>
>> <xsl:template match="book" mode="#default">
>>   <xsl:variable name="context" select="." as="element(book)" />
>>   <xsl:copy>
>>     <xsl:for-each-group select="descendant::node()[not(node())]"
group-starting-with="pb">
>>       <xsl:copy-of select="self::pb"/>
>>       <xsl:apply-templates select="$context/*" mode="split">
>>         <xsl:with-param name="restricted-to"
select="current-group()/ancestor-or-self::node()" tunnel="yes"/>
>>       </xsl:apply-templates>
>>     </xsl:for-each-group>
>>   </xsl:copy>
>> </xsl:template>
>>
>> <xsl:template match="node()" mode="split">
>>   <xsl:param name="restricted-to" as="node()+" tunnel="yes" />
>>   <xsl:if test="exists(. intersect $restricted-to)">
>>     <xsl:copy>
>>       <xsl:copy-of select="@*" />
>>       <xsl:apply-templates mode="#current" />
>>     </xsl:copy>
>>   </xsl:if>
>> </xsl:template>
>>
>> <xsl:template match="pb" mode="split"/>
>>
>> </xsl:stylesheet>
>>
>> On 04.07.2014 18:31, Geert Bormans geert@xxxxxxxxxxxxxxxxxxx wrote:
>>> Thanks Gerrit,
>>> (I admit I need to read this twice to get it, but that might be caused
>>> by the 0-1 and me not trying to miss all of the fun in Rio)
>>> I will look into it after the match
>>>
>>>
>>> At 17:18 4/07/2014, you wrote:
>>>> I tackle it by what I call C"b,Eupward projectionC"b,B:
>>>>
>>>> When processing the top-level element, do a for-each-group of all
>>>> descendants that are terminal nodes (those without children), with a
>>>> group-starting-with at the splitting points.
>>>>
>>>> For each group, process the book (or the HTML body, or whatever common
>>>> ancestor there is) once in another mode, with a tunneled parameter
>>>> 'restricted-to' that contains, for each group, the terminal nodes and
>>>> their ancestors.
>>>>
>>>> When processing each group, for each node that you encounter, test
>>>> whether the node is contained in the tunneled variable (using
>>>> intersect). If it is, reproduce the node and continue in this mode, if
>>>> it isnC"b,b"t contained, do nothing.
>>>>
>>>> There may be an option to discard or to reproduce the splitting
elements.
>>>>
>>>> Examples for this technique are in
>>>> https://subversion.le-tex.de/common/evolve-hub/evolve-hub.xsl, modes
>>>> hub:split-at-tab and hub:split-at-br
>>>>
>>>> They are a bit more complex than your case because they split
>>>> paragraphs that may contain tables or footnotes that in turn can
>>>> contain other paragraphs. I introduced the function
>>>> hub:same-scope($splitting-element, $containing-element) to split only
>>>> at splitting elements that are contained within the paragraph that
>>>> should be split, rather than in a paragraph that is contained in a
>>>> footnote or table cell that is somehow contained in the given paragraph.
>>>>
>>>> I might prepare a synthetic standalone example if anyone is
>>>> interested, and furthermore on the condition that interested parties
>>>> root for Germany instead of France today.
>>>>
>>>> Gerrit
>>>>
>>>> On 04.07.2014 16:43, Geert Bormans geert@xxxxxxxxxxxxxxxxxxx wrote:
>>>>> Hi all,
>>>>>
>>>>> Here is a fun one I thought I could share
>>>>>
>>>>> I have a nicely nested XML (a bit TEI like)
>>>>> and markers for page breaks can happen everywhere in the document (as
>>>>> empty elements)
>>>>>
>>>>> Now I want to break the document per page, reconstructing the structure
>>>>> So in a first step, I want to isolate the pagebreak to the highest
level
>>>>>
>>>>> <book>
>>>>> <title>...</title>
>>>>> <section>
>>>>> <para>aaa<pb/>bbb</para>
>>>>> </section>
>>>>> </book>
>>>>>
>>>>> to become
>>>>>
>>>>> <book>
>>>>> <title>...</title>
>>>>> <section>
>>>>> <para>aaa</para>
>>>>> </section>
>>>>> <pb/>
>>>>> <section>
>>>>> <para>bbb</para>
>>>>> </section>
>>>>> </book>
>>>>>
>>>>> Bearing in mind I need a generic solution
>>>>> and pagebreaks can happen at every level
>>>>>
>>>>> Any thoughts?
>>>>> I am not looking for code, just curious on how people would attack this
>>>>>
>>>>> Thanks
>>>>>
>>>>> Geert

Current Thread