[xsl] Splitting an XML file based on size

Subject: [xsl] Splitting an XML file based on size
From: Adam Van Den Hoven <Adam.Hoven@xxxxxxxxxxxx>
Date: Tue, 3 Apr 2001 15:50:04 -0700
Hey guys, 

I'm processing an NITF file into HTML. NITF is very much like HTML in that
it has a body with paragraph tags that has mixed content. The HTML that I am
creating from my tranforms can quickly become several tens of kb in size.
Since I'm transfering this over a wireless modem to a PocketPC at a maximum
of 14.4 kbs, an HTML file that is 15kb is entirely too big. 

I need some way to keep track of the number of characters I've processed and
stop when I reach a specific size, stoping at the end of the paragraph. I
understand that counting characters is not very precise but I am only
interested in getting the transfer size to be less than 2K or so. 

As an example, I might have the following NITF code:

<nitf baselang="en.ca">
   <head><!-- Header Metadata here --></head>
   <body>
      <body.head><!-- Body head stuff here --></body.head>
      <body.content>
         <p>
            Lorem ipsum dolor sit amet, 
            <em>consectetuer adipiscing elit, sed diem</em>
             nonummy nibh euismod tincidunt ut lacreet dolore magna aliguam
erat volutpat. 
         </p>
         <p>
            Lorem ipsum 
            <q>dolor sit amet, consectetuer adipiscing elit,</q>
             sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed
diem 
            <em>nonummy nibh euismod </em>
            tincidunt ut lacreet dolore magna aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, 
            <em>consectetuer adipiscing elit, </em>
            sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, 
            <q>consectetuer adipiscing elit,</q>
             sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed
diem nonummy nibh euismod tincidunt ut lacreet dolore magna aliguam erat
volutpat. </p>
      </body.content>
      <body.end><!-- tagline here --></body.end>
   </body>
</nitf>

The text there happens to be nearly 500 characters. Lets say that my target
size is 375 characters. That should be "o" in "euismod" in the third <p>
tag. Normally I would create:
<html>
   <head><!-- Header Metadata here --></head>
   <body>
         <p>
            Lorem ipsum dolor sit amet, 
            <em>consectetuer adipiscing elit, sed diem</em>
             nonummy nibh euismod tincidunt ut lacreet dolore magna aliguam
erat volutpat. 
         </p>
         <p>
            Lorem ipsum 
            <q>dolor sit amet, consectetuer adipiscing elit,</q>
             sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed
diem 
            <em>nonummy nibh euismod </em>
            tincidunt ut lacreet dolore magna aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, 
            <em>consectetuer adipiscing elit, </em>
            sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, 
            <q>consectetuer adipiscing elit,</q>
             sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed
diem nonummy nibh euismod tincidunt ut lacreet dolore magna aliguam erat
volutpat. </p>
   </body>
</html>

but what I want to create is:

<html>
   <head><!-- Header Metadata here --></head>
   <body>
         <p>
            Lorem ipsum dolor sit amet, 
            <em>consectetuer adipiscing elit, sed diem</em>
             nonummy nibh euismod tincidunt ut lacreet dolore magna aliguam
erat volutpat. 
         </p>
         <p>
            Lorem ipsum 
            <q>dolor sit amet, consectetuer adipiscing elit,</q>
             sed diem nonummy nibh euismod tincidunt ut lacreet dolore magna
aliguam erat volutpat. 
         </p>
         <p>
            Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed
diem 
            <em>nonummy nibh euismod </em>
            tincidunt ut lacreet dolore magna aliguam erat volutpat. 
         </p>
         <p><a href="someURL">View Entire story</a></p>
   </body>
</html>

> I can't be so coarse as counting paragraphs since I might also have a
> table (essentially an HTML table) or lists or something. Some paragraphs
> will be as short as a single sentance, others will be much longer. 
> 
> I also need to do some additional processing after I reach the end of the
> NITF text (but the size of those will be much more rigid and simply
> subtracted from the target filesize). 
> 
> I had thought about doing something approximately like:
> 
> <xsl:template match="p" mode="block">
> 	<xsl:param name="cursize" select="0">
> 	<xsl:variable name="size" select="$cursize" />
> 	<p>
> 		<xsl:apply-templates select="child::node()" mode="inline">
> 			<xsl:with-param name="cursize" select="$size + 7" />
> <!-- +7 characters for the tags -->
> 		</xsl:apply-templates>
> 	</p>
> 	<xsl:if test="$size <= 400">
> 		<xsl:apply-templates match="followingsibling::p[1]"
> mode="block"/>
			<xsl:with-param name="cursize" select="$size"
		</xsl:apply-templates>
> 	</xsl:if>
> </xsl:template>
> 
> but clearly that isn't going to work. I also assume that making a global
> variable called $size wouldn't work either.
> 
> I am getting the feeling that this isn't strictly possible with XSL. I am
> using MSXML 3 so scripting might be a solution but I am loath to use it
> unless I have to. 
> 
> Adam van den Hoven
> Internet Application Developer
> Blue Zone
> tel. 604.685.4310
> fax. 604.685.4391
> Blue Zone makes you interactive.(tm) http://www.bluezone.net/
> 

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Current Thread