RE: [xsl] Dividing documents based on size of contents

Subject: RE: [xsl] Dividing documents based on size of contents
From: "Michael Kay" <mike@xxxxxxxxxxxx>
Date: Wed, 27 May 2009 09:12:51 +0100
I think this is a case for "sibling recursion" - in fact, it's the example I
use on training courses, if I think the group is capable of tackling the
problem (it tends to cause significant headache, and people are typically
amazed how after 3 hours head-scratching, the answer turns out to be about
ten lines of code).

It's probably easiest to do this in two phases: the first phase copies the
documentDivision elements, inserting a <documentBreak/> element where
appropriate, and the second phase uses for-each-group
starting-with="documentBreak" to create the document elements.

The sibling recursion works like this

 <xsl:template match="documentDivision">
   <xsl:param name="size-so-far" as="xs:integer"/>
   <xsl:variable name="new-size-so-far" as="xs:integer"
                 select="$size-so-far + count(pagebreak)"/>
   <xsl:variable name="start-new-document" as="xs:boolean"
                 select="$new-size-so-far gt 100"/>
   <xsl:copy-of select="."/>
   <xsl:if test="$start-new-document">
     <documentBreak/>
   </xsl:if>
   <xsl:apply-templates select="following-sibling::documentDivision[1]">
     <xsl:with-param name="size-so-far"
          select="if ($start-new-document) then 0 else $new-size-so-far"/>
     </xsl:with-param>
   </xsl:apply-templates>
 </xsl:template> 


and then you start the process off with

 <xsl:template match="document">
   <xsl:apply-templates select="documentDivision[1]"/>
 </xsl:template>

Regards,

Michael Kay
http://www.saxonica.com/
http://twitter.com/michaelhkay 


> -----Original Message-----
> From: Chris von See [mailto:chris@xxxxxxxxxxxxx] 
> Sent: 27 May 2009 02:54
> To: xsl-list
> Subject: [xsl] Dividing documents based on size of contents
> 
> Hi all -
> 
> I have what I think is a fairly simple problem, but I'm 
> having trouble with the implementation in XSLT.  Any help you 
> could give would be greatly appreciated.
> 
> I have a document which is subdivided into multiple sections, 
> with each section, in turn, divided into pages as shown below:
> 
> <document>
> 	<documentDivision>
> 		... arbitrary content ...
> 		<pagebreak />
> 		... arbitrary content ...
> 		<pagebreak />
> 	</documentDivision>
> 
> 	... arbitrary number of <documentDivision> elements ...
> 
> </document>
> 
> Each <documentDivision> section of the document can have an 
> arbitrary number of <pagebreak> elements, and an arbitrary 
> amount of content between <pagebreak>s.
> 
> I'd like to be able to break the input <document> into 
> multiple <document>s, each of which has the minimum number of 
> <documentDivision> sections that give it a <pagebreak> count 
> ~100 pages.  I'd like to break the input at 
> <documentDivision> boundaries, but I don't need the output 
> documents to be equally sized or to be exactly 100 pages long 
> - just as close to that size as I can reasonably get while 
> maintaining the <documentDivision> boundaries.
> 
> So for example if I have an input document that looks like this:
> 
> <document>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 127 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 5 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 23 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 78 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
> 
> the output documents should look like this, with each output 
> document being "close" to 100 pages in length:
> 
> <!-- This doc has enough <documentDivision> elements to give 
> exactly 100 pages. --> <document>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 50 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
> 
> <!-- This doc has a single <documentDivision> element with 
> 127 pages - close enough! --> <document>
> 	<documentDivision>
> 		... content containing 127 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
> 
> <!-- This doc has a three <documentDivision> elements of 5, 
> 23 and 78 pages each - close enough! --> <document>
> 	<documentDivision>
> 		... content containing 5 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 23 <pagebreak /> elements ...
> 	</documentDivision>
> 	<documentDivision>
> 		... content containing 78 <pagebreak /> elements ...
> 	</documentDivision>
> </document>
> 
> I've been able to figure out how to get the number of 
> <pagebreak>s per <documentDivision> and how to calculate the 
> number of <pagebreak>s in any given group of 
> <documentDivision> sections, but what I'm not sure of is how 
> to maintain information about the point at which I last 
> created a new output document so that I can determine what 
> group of <documentDivision> elements has a page count around 
> 100 and should therefore be used to create a new output 
> document.  It seems that the best way to carry this context 
> would be via params to xsl;apply- templates, but I'm not 
> clear on how to set up the XSLT code so that the state gets 
> maintained as I iterate through <documentDivision> elements.  
> It also seems like there should be some XPath expression that 
> I can use with xsl:for-each-group, but I can't quite figure 
> out how to write that such that each group has only the 
> minimum number of <documentDivision> elements needed to 
> accumulate 100-ish pages.
> 
> Do you have any guidance on ways to do this?  I think I'm 
> just having a mental block, and a swift kick in the right 
> direction should do the trick.
> 
> 
> Thanks
> Chris

Current Thread