RE: sgml-parse and GC

Subject: RE: sgml-parse and GC
From: "Didier PH Martin" <martind@xxxxxxxxxxxxx>
Date: Thu, 22 Jul 1999 06:07:55 -0400
Hi Peter,

> Didier says:
> About grove caching, I am not so sure that keeping a grove is a good
thing.
> For example, the main grove (i.e. the grove created for the processed
> document) could be released as soon as the (process-children) procedure is
> finished on the Root . Same thing for a grove returned from a sgml-parse

Peter said:
Then the whole FOT is constructed, and all DSSSL processing done so this
won't buy you much.

Didier says:
Ad contrario, this gives you something. Again, take the example where you
want to process a collection of documents. All documents are included in a
SGML/XML document as below:

<collection>
<document href="c:/mydir/mydoc1.sgm">
<document href="http://www.netfolder.com/mydoc2.xml";>
</collection>

Then, a DSSSL script process this source document and contain a rule to
process the document element as:

(element document
	(sgml-parse (attribute-string "href" (current-node))
	(process-node-list)
)

then, a thread can be set to process this new grove (as a autonomous entity)
until the current-root and all its children is processed. For each document
element we would have a separate grove and this grove processed in a
separate thread. This way, a batch job to processed a collection of
documents could be expressed as a SGML document itself instead of a platform
dependent batch file. So, for example, If I create a "bat" file on windows
this won't work on Linux. But if the batch processing file is expressed in
SGML or XML it is portable to any platform running OpenJade.

If the target machine do not have a lot of memory, then instead of starting
the processing of a grove in a thread, then it could be processed in the
main thread and de facto, processed one at a time. This implies that only
two grove are present in memory at the same time:
a) the source grove
b) the independent grove

If the source grove is relatively small (contains the document collection to
be processed), then most of the memory resource is left for the independent
grove. So, yes we gained something: platform independent DSSSL batch
processing expression language. Better than that, the expression language
for the batch processing is a SGML application!

> where the process is finished when the (process-node-list) on the grove's
> root is finished. In both cases, the grove could be released because in
both
> ways, the FOT is completed because the processing on the root element is
> completed ( and therefore for all its children). Then a default condition

Peter said:
Since the values for the characteristics for the resulting sosofo is not
evaluated immediately, this is not quiet correct. The FOT has to be built
so that all characteristics expressions can be evaluated before removing
the grove from memory (because current-node might be used in a
characteristic specification). Since flow objects might "bubble up" in the
resulting FOT (if they are labeled and the label doesn't correspond to any
port in the content-map for the constructed sosofo), a characteristic
might be evaluated above the constructed flow object in the end. I think
the refcounters will solve these problems, but then if you don't cache the
nodes you might get two (or more) versions of the same grove in memory at
the same time if you call sgml-parse in more than one place.

Another problem if you remove the groves. What should
(node-list=? (sgml-parse foo) (sgml-parse foo))
return? If it has to return true, isn't it hard to implement if you don't
keep the groves?

Didier says:
Good points. Then we probably need to introduce an explicit construct that
state that the grove is released as soon as the whole process-children is
done on the whole grove. The script writer knows the processing context. So,
maybe a construct like (process-and-release-node-list) or something similar
would do the job. I agree, that current implementation has limitation on
this side and that platform independent batch processing expressed as a SGML
application cannot be realistically done in the current implementation.
Thus, not to fall into the situation you stated, a new construct like
(process-and-release-node-list) would resolve the problem you stated.

thus, now the DSSSL expresssion state earlier would be:

(element document
	(sgml-parse (attribute-string "href" (current-node))
	(process-and-release-node-list)
)

In this case, as soon as the node-list is processed, then it is released.
This is, naturally one of the ways to implement it. By discussing it, we may
find a better construct (this one, has at least, the advantage to be
explicit and self explanatory). I know, the original language conception is
based on infinite resources, infinite time, infinite... But, we, as mortals,
have to live in limited worlds, sometime our expression tools should reflect
this world full of limitation. Usually, when that is the case, something
useful emerge.


> Peter said:
> BTW, the nodes in the grove have to stay accessible until the FOT is
> built. This I think is true for all nodes resulting in something in the
> FOT. See FOTBuilder::startNode()/endNode().
>
> So my conclusion is that you'll need a lot of virtual memory (or other
> storage for the groves) to process large documents. I don't see how to
> make this different. (Ofcourse you may have the groves in a database.)
>
> Didier says:
> The grove has to be present as long as the processing is not completed for
> the root node and therefore not until all its children are processed.
Thus,
> at least two groves could be present at a time:
> a) the source document grove
> b) the sgml-parse resultant grove.
>
> Off course, some scripts may lead to a situation where more than two
groves
> are simultaneously present and then would require a lot of virtual memory
> (and then cause swapping). Speaking of swapping, it depend a lot of how we

Peter said:
If you want to process large documents and wanna be able to navigate
arbitrarily ghrough them (which DSSSL requires), then you will need a lot
of memory. How else would it be? Maybe the grove implementation could be
optimized better for memory, but I don't think so since James Clark
probably spent a lot of effort in this critical area.

Didier says:
Don't forget Peter that we define now the future of DSSSL. With this in
mind, we can add new useful constructs and then bring that ISO as a draft.
But this time, with practical implementation and experience behind us. Then,
a new construct could be included and set with the -2 flag (for DSSSL-2) and
used as experimental future DSSSL standard construct. Yes James put a lot of
effort but so do we (the OpenJade team), and, in our case, we want a future
for OpenJade, and better than that, a new DSSSL-2 international standard
that includes what we learned from the praxis.

regards
Didier PH Martin
mailto:martind@xxxxxxxxxxxxx
http://www.netfolder.com


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread