RE: [xsl] Dividing documents based on size of contents

Subject: RE: [xsl] Dividing documents based on size of contents
From: Emmanuel Begue <eb@xxxxxxxxxx>
Date: Wed, 27 May 2009 09:23:57 +0200
Hello,

This is a grouping problem.

Given this source document:

<document>
	<documentDivision>
		<pagebreak num="50"/>
		</documentDivision>
	<documentDivision>
		<pagebreak num="50"/>
		</documentDivision>
	<documentDivision>
		<pagebreak num="127"/>
		</documentDivision>
	<documentDivision>
		<pagebreak num="5"/>
		</documentDivision>
	<documentDivision>
		<pagebreak num="23"/>
		</documentDivision>
	<documentDivision>
		<pagebreak num="78"/>
		</documentDivision>
	</document>

where the number of pagebreaks is found in @num
(whereas in reality the number of pagebreaks should
be computed from counting pagebreak elements), you
can get what you want with this template:

<xsl:template match="/document">
  <xsl:copy>
    <xsl:for-each-group select="documentDivision"
      group-adjacent="floor(sum(pagebreak/@num) div 100) = 1">
      <doc>
        <xsl:attribute name="pagebreaks"
          select="sum(current-group()/pagebreak/@num)"/>
        <xsl:copy-of select="current-group()"/>
        </doc>
      </xsl:for-each-group>
    </xsl:copy>
  </xsl:template>

which results in:

<document>
   <doc pagebreaks="100">
      <documentDivision>
         <pagebreak num="50"/>
      </documentDivision>
      <documentDivision>
         <pagebreak num="50"/>
      </documentDivision>
   </doc>
   <doc pagebreaks="127">
      <documentDivision>
         <pagebreak num="127"/>
      </documentDivision>
   </doc>
   <doc pagebreaks="106">
      <documentDivision>
         <pagebreak num="5"/>
      </documentDivision>
      <documentDivision>
         <pagebreak num="23"/>
      </documentDivision>
      <documentDivision>
         <pagebreak num="78"/>
      </documentDivision>
   </doc>
</document>

Of course, as stated above, you need to adjust the "group-adjacent"
attribute so that it uses a proper method to count pagebreaks according
to your actual source document.

Also, the principle is that group-adjacent keeps adding elements to
the group to satisfy the clause, so that if you have this sequence
of pagebreaks:
1
97
250

you will get one doc with all those 348 pages, whereas you might
have prefered to have one doc with 98 pages and another with 250.
But you can tweak that, maybe with multiple passes.

Hope this helps,
Regards,
EB

> -----Original Message-----
> From: Chris von See [mailto:chris@xxxxxxxxxxxxx]
> Sent: Wednesday, May 27, 2009 3:54 AM
> To: xsl-list
> Subject: [xsl] Dividing documents based on size of contents
>
>
> Hi all -
>
> I have what I think is a fairly simple problem, but I'm having trouble
> with the implementation in XSLT.  Any help you could give would be
> greatly appreciated.
>
> I have a document which is subdivided into multiple sections, with
> each section, in turn, divided into pages as shown below:
>
> <document>
> 	<documentDivision>
> 		... arbitrary content ...
> 		<pagebreak />
> 		... arbitrary content ...
> 		<pagebreak />
> 	</documentDivision>
>
> 	... arbitrary number of <documentDivision> elements ...
>
> </document>
>
> Each <documentDivision> section of the document can have an arbitrary
> number of <pagebreak> elements, and an arbitrary amount of content
> between <pagebreak>s.
>
> I'd like to be able to break the input <document> into multiple
> <document>s, each of which has the minimum number of
> <documentDivision> sections that give it a <pagebreak> count ~100
> pages.  I'd like to break the input at <documentDivision> boundaries,
> but I don't need the output documents to be equally sized or to be
> exactly 100 pages long - just as close to that size as I can
> reasonably get while maintaining the <documentDivision> boundaries.
>
> So for example if I have an input document that looks like this:
>
> <document>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 127 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 5 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 23 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 78 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
>
> the output documents should look like this, with each output document
> being "close" to 100 pages in length:
>
> <!-- This doc has enough <documentDivision> elements to give exactly
> 100 pages. -->
> <document>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
>
> <!-- This doc has a single <documentDivision> element with 127 pages -
> close enough! -->
> <document>
> 	<documentDivision>
> 		... content containing 127 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
>
> <!-- This doc has a three <documentDivision> elements of 5, 23 and 78
> pages each - close enough! -->
> <document>
> 	<documentDivision>
> 		... content containing 5 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 23 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 78 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
>
> I've been able to figure out how to get the number of <pagebreak>s per
> <documentDivision> and how to calculate the number of <pagebreak>s in
> any given group of <documentDivision> sections, but what I'm not sure
> of is how to maintain information about the point at which I last
> created a new output document so that I can determine what group of
> <documentDivision> elements has a page count around 100 and should
> therefore be used to create a new output document.  It seems that the
> best way to carry this context would be via params to xsl;apply-
> templates, but I'm not clear on how to set up the XSLT code so that
> the state gets maintained as I iterate through <documentDivision>
> elements.  It also seems like there should be some XPath expression
> that I can use with xsl:for-each-group, but I can't quite figure out
> how to write that such that each group has only the minimum number of
> <documentDivision> elements needed to accumulate 100-ish pages.
>
> Do you have any guidance on ways to do this?  I think I'm just having
> a mental block, and a swift kick in the right direction should do the
> trick.
>
>
> Thanks
> Chris

Current Thread