Re: [xsl] Splitting an XML file based on size

Subject: Re: [xsl] Splitting an XML file based on size
From: dan mason <dmason@xxxxxxxxxxxxxxxx>
Date: Wed, 4 Apr 2001 10:30:24 -0400
Date: Tue, 3 Apr 2001 15:50:04 -0700
From: Adam Van Den Hoven <Adam.Hoven@xxxxxxxxxxxx>
Subject: [xsl] Splitting an XML file based on size

Hey guys,

I'm processing an NITF file into HTML. NITF is very much like HTML in that
it has a body with paragraph tags that has mixed content. The HTML that I am
creating from my tranforms can quickly become several tens of kb in size.
Since I'm transfering this over a wireless modem to a PocketPC at a maximum
of 14.4 kbs, an HTML file that is 15kb is entirely too big.

I need some way to keep track of the number of characters I've processed and
stop when I reach a specific size, stoping at the end of the paragraph. I
understand that counting characters is not very precise but I am only
interested in getting the transfer size to be less than 2K or so.

I used to work on the development of a mobile applications platform (NetMorf SiteMorfer) that had to deal with byte size pagination (that's what we called this problem) in a flexible, automagic way for n applications and n devices, all of which had different digest sizes (some mandatory, others suggested, like for the Pocket PC, Palm, RIM, etc.), numbers of rows, numbers of accesskeys, etc.. The short answer is that it's not easy in general, and especially not in XSLT. Before I get flamed, let me try to explain why :) and invite people to produce a pure XSLT solution, because I know it's possible, but I also know that it's a royal pain in the behind (at least, the way I was trying to do it).

Solution 1 would be the pure XSLT solution. Like I said, I think it's possible, your code snippet down below is a start. But I think it's going to be extremely hard to make a solution like that extensible (you may end up writing the same code for <p>, <table> and any other tags, just slightly different). Also, I'll go out on a limb here and make a blanket statement: XSLT (this version, anyway) is not supposed to be the end point of a delivery architecture. XSLT is designed for document transformation, so going from unpaginated NITF to unpaginated HTML is almost trivial, as you know. But it has no clue what device it's talking to, which delivery architectures have to know and take into account. You could make your stylesheet aware of the device and its capabilities, although the colossal pain of keeping variables for byte size, number of rows, number of accesskeys (for phones), and linking to the data you didn't have room for will keep you up nights.

You could probably use extension functions or calls out to Java classes to give you more power and a cleaner stylesheet, but it's still a pain (and I have no idea what the performance implications are). I don't know much about that stuff; it's possible that a few extension functions would be able to keep track of where you are and short circuit the transformation when you overflow, but I don't remember whether they can be stateful? if not, Java calls would work, I ended up writing a Java class to catch and paginate tags as I wrote them, with varying levels of success.

Solution 2 would be to use XSLT and build a pagination engine that takes in the output and chops it down to size. This makes a lot more sense to me, all you have to do is make sure you're spitting out XHTML, parse it, and go through and count bytes. You still have to decide what to do with the data you chop off, and you have to make sure you never chop off a valid end tag, things like that, but it's doable. I worked on a prototype of a system like this, but for n devices; instead of spitting out XHTML, we used our own XML to preserve structure, and then embedded markup inside it (WML, HDML, HTML, whatever). So, based on universal rules for how to paginate our XML (in your case, NITF), we could chop markup for any device down to size using one component. It was spiffy.

If you can pull off solution 2, it has a bunch of advantages: 1) you can reuse your pagination engine for multiple apps, and not have to write it all into each stylesheet (I know you can simplify this by inheriting XSLT templates, but I dare anyone to do it :), 2) the stylesheet author (if it's not you) doesn't have to know how to paginate anything, they can just write XSLT and not worry about it, and 3) your stylesheets are cleaner, and don't take as long to execute (probably, there are performance implications for splitting the job like this too, as we have to reparse the XHTML, etc.). I did all this in C++, a coworker did the same thing in Java, don't know how easy it would be to do in a scripting environment.

Good luck, I hope this is useful, and more than that, I would love to hear about experiences other people have had with paginating in XSLT. I know that at least for mobile apps, this was concern #1, and everybody had a story on how to do it. Not being an XSLT guru, I didn't know the answer, but I figure somebody on this list might...


I can't be so coarse as counting paragraphs since I might also have a
table (essentially an HTML table) or lists or something. Some paragraphs
will be as short as a single sentance, others will be much longer.

I also need to do some additional processing after I reach the end of the
NITF text (but the size of those will be much more rigid and simply
subtracted from the target filesize).

I had thought about doing something approximately like:

<xsl:template match="p" mode="block">
	<xsl:param name="cursize" select="0">
	<xsl:variable name="size" select="$cursize" />
		<xsl:apply-templates select="child::node()" mode="inline">
			<xsl:with-param name="cursize" select="$size + 7" />
<!-- +7 characters for the tags -->
	<xsl:if test="$size <= 400">
		<xsl:apply-templates match="followingsibling::p[1]"
			<xsl:with-param name="cursize" select="$size"

but clearly that isn't going to work. I also assume that making a global
variable called $size wouldn't work either.

I am getting the feeling that this isn't strictly possible with XSL. I am
using MSXML 3 so scripting might be a solution but I am loath to use it
unless I have to.

Adam van den Hoven
Internet Application Developer
Blue Zone
tel. 604.685.4310
fax. 604.685.4391
Blue Zone makes you interactive.(tm)

XSL-List info and archive:

Current Thread