[xsl] Dividing documents based on size of contents

Subject: [xsl] Dividing documents based on size of contents
From: Chris von See <chris@xxxxxxxxxxxxx>
Date: Tue, 26 May 2009 18:54:15 -0700
Hi all -

I have what I think is a fairly simple problem, but I'm having trouble with the implementation in XSLT. Any help you could give would be greatly appreciated.

I have a document which is subdivided into multiple sections, with each section, in turn, divided into pages as shown below:

<document>
	<documentDivision>
		... arbitrary content ...
		<pagebreak />
		... arbitrary content ...
		<pagebreak />
	</documentDivision>

... arbitrary number of <documentDivision> elements ...

</document>

Each <documentDivision> section of the document can have an arbitrary number of <pagebreak> elements, and an arbitrary amount of content between <pagebreak>s.

I'd like to be able to break the input <document> into multiple <document>s, each of which has the minimum number of <documentDivision> sections that give it a <pagebreak> count ~100 pages. I'd like to break the input at <documentDivision> boundaries, but I don't need the output documents to be equally sized or to be exactly 100 pages long - just as close to that size as I can reasonably get while maintaining the <documentDivision> boundaries.

So for example if I have an input document that looks like this:

<document>
	<documentDivision>
		... content containing 50 <pagebreak /> elements ...
	</documentDivision>
	<documentDivision>
		... content containing 50 <pagebreak /> elements ...
	</documentDivision>
	<documentDivision>
		... content containing 127 <pagebreak /> elements ...
	</documentDivision>
	<documentDivision>
		... content containing 5 <pagebreak /> elements ...
	</documentDivision>
	<documentDivision>
		... content containing 23 <pagebreak /> elements ...
	</documentDivision>
	<documentDivision>
		... content containing 78 <pagebreak /> elements ...
	</documentDivision>
</document>

the output documents should look like this, with each output document being "close" to 100 pages in length:

<!-- This doc has enough <documentDivision> elements to give exactly 100 pages. -->
<document>
<documentDivision>
... content containing 50 <pagebreak /> elements ...
</documentDivision>
<documentDivision>
... content containing 50 <pagebreak /> elements ...
</documentDivision>
</document>


<!-- This doc has a single <documentDivision> element with 127 pages - close enough! -->
<document>
<documentDivision>
... content containing 127 <pagebreak /> elements ...
</documentDivision>
</document>


<!-- This doc has a three <documentDivision> elements of 5, 23 and 78 pages each - close enough! -->
<document>
<documentDivision>
... content containing 5 <pagebreak /> elements ...
</documentDivision>
<documentDivision>
... content containing 23 <pagebreak /> elements ...
</documentDivision>
<documentDivision>
... content containing 78 <pagebreak /> elements ...
</documentDivision>
</document>


I've been able to figure out how to get the number of <pagebreak>s per <documentDivision> and how to calculate the number of <pagebreak>s in any given group of <documentDivision> sections, but what I'm not sure of is how to maintain information about the point at which I last created a new output document so that I can determine what group of <documentDivision> elements has a page count around 100 and should therefore be used to create a new output document. It seems that the best way to carry this context would be via params to xsl;apply- templates, but I'm not clear on how to set up the XSLT code so that the state gets maintained as I iterate through <documentDivision> elements. It also seems like there should be some XPath expression that I can use with xsl:for-each-group, but I can't quite figure out how to write that such that each group has only the minimum number of <documentDivision> elements needed to accumulate 100-ish pages.

Do you have any guidance on ways to do this? I think I'm just having a mental block, and a swift kick in the right direction should do the trick.


Thanks Chris

Current Thread