Summary of DocBook HTML speedup improvements

Subject: Summary of DocBook HTML speedup improvements
From: Norman Walsh <norm@xxxxxxxxxxxxx>
Date: Wed, 1 Apr 1998 07:13:34 -0500
Hello world,

James suggested that I post an explanation of the changes I made
to the DocBook HTML stylesheet to improve its performance, so
here's a quick summary.

One of the major differences between the print and HTML
stylesheets is that the HTML stylesheet chunks the content up
into pieces.  The stylesheet then has to construct links between
all the chunks.

The slowest part of this process is calculating the filename
that will be used for any given chunk.  

The algorithm used to be that the name of a chunk was a mnemonic
for the kind of chunk it was ("c" for chapter, "a" for
appendix", etc.) followed by the (element-number) of the chunk.
In the case of SECT1 chunks, the (element-number) of the sect1
was appended to the filename calculated for its parent.

So, for example, the base filename the second preface was "f02".
The base filename for the third section of second chapter, was
"c0203".  The fourth section in the first appendix was "a0104",
etc.

It turns out that calculating (element-number) is, relatively
speaking, very slow.  This problem is exacerbated by the fact
that filenames have to be calculated not only for navigation,
but also for every xref or link.  

The solution was to use the (all-element-number) function
instead of (element-number).  All-element-number is an extension
supported by Jade; it efficiently returns the number of the node
within all of the elements in the grove.

From the point of view of filenames, this has the disadvantage
that there's no longer any way to make the filenames meaningful.
But that seems like a small price to pay for a performance
improvement of at least a factor of five.

The new algorithm used to calculate filenames is to append the
(all-element-number) of the element to a mnemonic for the type
of chunk that it is.

Another wrinkle in filename calculation is that the stylesheet
supports PIs to specify the desired filename.  This way you can
change the filename of the root element from "book01.htm" to
"index.html".

In order to find the PIs, I was effectively doing a loop over
every child node of the component-level elements.  James pointed
out that if this loop ran over mixed content, it would be very
inefficient.  (While DSSSL specifies that every character is a
node in the grove, Jade treats the character data in a much more
efficient manner _unless_ a DSSSL expression requires it to
access individual characters.)  In the case of DocBook, I don't 
think that my loop ever actually ran over mixed content, but the
solution is worth noting:

To find PIs, rather than using a loop over the children of a
node, loop over (select-by-class (children node) 'pi).  The
select-by-class function efficiently filters out the PIs.

Those two changes, particularly the use of all-element-number, have
made the DocBook HTML stylesheet much more useful for large documents.

Thanks, James!

--norm


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread