Re: [xsl] breaking up XML on page break element

Subject: Re: [xsl] breaking up XML on page break element
From: "Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx" <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>
Date: Wed, 16 Jul 2014 21:49:54 -0000
Thanks for the positive feedback, Michael.

I don't have a clue as to how to make this technique streamable. The primary issue seems to be that the select expression of the for-each-group is free-ranging. Saxon's streamability analysis objects the expression in two ways: first, descendant::node() is deemed free-ranging (while */*/node(), for example, isn't). In addition, the predicate [not(node())] must be motionless (may not look around in the input stream). So Saxon bails out early.

When I was looking for a workaround, Saxon told me this:

select="*/pb | */*/pb | */*/*/pb | */text() | */*/text() | */*/*/text()"
-> SXST0070: Template rule cannot be streamed, although it is guaranteed streamable according to W3C rules. In a streamable for-each-group, if the body reads the streamed input then the select expression must select nodes from the streamed input


I'm not an expert on streaming either, but I guess that Saxon is right in refusing this approach because it requires an unlimitable amount of peeking into the stream to gather the population to be grouped. One needs to adopt a different approach. I don't know which.

Anyway, I think that splitting large input files at milestone elements anywhere down the tree is a valid use case for streaming.

Corollary: If someone comes up with a streaming XSLT solution to this problem, I henceforth shall believe in streaming XSLT.

Corollary 2: How would people tackle this with streaming-by-default languages, for example while they see a SAX stream rolling by?

Gerrit

On 16.07.2014 18:46, Michael MC<ller-Hillebrand mmh@xxxxxxxxx wrote:
Hi Gerrit,

What a great, cool, solution! I think of applying this to split FO page-sequences.

Because we will have to deal with very large documents the question of streaming comes to my mind.
I have not yet looked up the many many restrictions for streamable expressions, but does anyone of you have a feeling whether streaming could be used here? Like: Process each segment of relevant nodes separately.

Thanks for your suggestions,

- Michael

Am 04.07.2014 um 20:20 schrieb Imsieke, Gerrit, le-tex gerrit.imsieke@xxxxxxxxx <xsl-list-service@xxxxxxxxxxxxxxxxxxxxxx>:

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
  version="2.0">

<xsl:output indent="yes"/>

  <xsl:template match="* | @*" mode="#default">
    <xsl:copy>
      <xsl:apply-templates select="@*, node()" mode="#current"/>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="book" mode="#default">
    <xsl:variable name="context" select="." as="element(book)" />
    <xsl:copy>
      <xsl:for-each-group select="descendant::node()[not(node())]" group-starting-with="pb">
        <xsl:copy-of select="self::pb"/>
        <xsl:apply-templates select="$context/*" mode="split">
          <xsl:with-param name="restricted-to" select="current-group()/ancestor-or-self::node()" tunnel="yes"/>
        </xsl:apply-templates>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

  <xsl:template match="node()" mode="split">
    <xsl:param name="restricted-to" as="node()+" tunnel="yes" />
    <xsl:if test="exists(. intersect $restricted-to)">
      <xsl:copy>
        <xsl:copy-of select="@*" />
        <xsl:apply-templates mode="#current" />
      </xsl:copy>
    </xsl:if>
  </xsl:template>

<xsl:template match="pb" mode="split"/>

</xsl:stylesheet>

On 04.07.2014 18:31, Geert Bormans geert@xxxxxxxxxxxxxxxxxxx wrote:
Thanks Gerrit,
(I admit I need to read this twice to get it, but that might be caused
by the 0-1 and me not trying to miss all of the fun in Rio)
I will look into it after the match


At 17:18 4/07/2014, you wrote:
I tackle it by what I call C"b,Eupward projectionC"b,B:

When processing the top-level element, do a for-each-group of all
descendants that are terminal nodes (those without children), with a
group-starting-with at the splitting points.

For each group, process the book (or the HTML body, or whatever common
ancestor there is) once in another mode, with a tunneled parameter
'restricted-to' that contains, for each group, the terminal nodes and
their ancestors.

When processing each group, for each node that you encounter, test
whether the node is contained in the tunneled variable (using
intersect). If it is, reproduce the node and continue in this mode, if
it isnC"b,b"t contained, do nothing.

There may be an option to discard or to reproduce the splitting elements.

Examples for this technique are in
https://subversion.le-tex.de/common/evolve-hub/evolve-hub.xsl, modes
hub:split-at-tab and hub:split-at-br

They are a bit more complex than your case because they split
paragraphs that may contain tables or footnotes that in turn can
contain other paragraphs. I introduced the function
hub:same-scope($splitting-element, $containing-element) to split only
at splitting elements that are contained within the paragraph that
should be split, rather than in a paragraph that is contained in a
footnote or table cell that is somehow contained in the given paragraph.

I might prepare a synthetic standalone example if anyone is
interested, and furthermore on the condition that interested parties
root for Germany instead of France today.

Gerrit

On 04.07.2014 16:43, Geert Bormans geert@xxxxxxxxxxxxxxxxxxx wrote:
Hi all,

Here is a fun one I thought I could share

I have a nicely nested XML (a bit TEI like)
and markers for page breaks can happen everywhere in the document (as
empty elements)

Now I want to break the document per page, reconstructing the structure
So in a first step, I want to isolate the pagebreak to the highest level

<book>
<title>...</title>
<section>
<para>aaa<pb/>bbb</para>
</section>
</book>

to become

<book>
<title>...</title>
<section>
<para>aaa</para>
</section>
<pb/>
<section>
<para>bbb</para>
</section>
</book>

Bearing in mind I need a generic solution
and pagebreaks can happen at every level

Any thoughts?
I am not looking for code, just curious on how people would attack this

Thanks

Geert



-- Gerrit Imsieke GeschC$ftsfC<hrer / Managing Director le-tex publishing services GmbH Weissenfelser Str. 84, 04229 Leipzig, Germany Phone +49 341 355356 110, Fax +49 341 355356 510 gerrit.imsieke@xxxxxxxxx, http://www.le-tex.de

Registergericht / Commercial Register: Amtsgericht Leipzig
Registernummer / Registration Number: HRB 24930

GeschC$ftsfC<hrer: Gerrit Imsieke, Svea Jelonek,
Thomas Schmidt, Dr. Reinhard VC6ckler

Current Thread