RE: jade with multiple input files / ESIS data revisited

Subject: RE: jade with multiple input files / ESIS data revisited
From: "James.W Wilson" <James.W.Wilson@xxxxxxxxxxxxx>
Date: Thu, 18 Dec 1997 10:57:00 -0600
     
          I've gotten a number of (very helpful and quick, btw) responses 
     to my email asking about multiple input files and the problem of the 
     dtd size. Rather than wasting bandwidth by quoting them let me just 
     summarize what we're trying to do and we can go from there.
     
     We have gigantic (300 meg) chunks of sgml with all sorts of complex 
     coding in them. They conform to a big dtd (which, as someone 
     suggested, is the union of many specializations of a singe generic 
     model) which our particular group has no control over. However, we 
     don't want to convert everything to sgml, just certain pieces which 
     occur within certain tags. These pieces are usually small (average 
     size is < 2k, max size is maybe 300k) and are extracted from the main 
     chunk and placed into umpteen thousand files (44,000 for one 
     particular chunk) by our current process.
     
     Since dsssl is side-effect-free, I presume that we can't parse the one 
     gigantic chunk and have it output one rtf file for each little piece 
     we're interested in (this would be ideal). We could generate one 
     gigantic rtf file, of course, but how to split that into many little 
     rtf files?
     
     Failing that, we can run jade on each little piece; however, we need 
     entities and presumably element declarations from the dtd so jade can 
     correctly parse the input. The problem with including the dtd is, as 
     mentioned before, its size; per-file overhead swamps everything else 
     in our situation. 
     
     The data has a lot of end-tag-minimizationm and without the dtd jade 
     apparently has to guess where tags begin and end, and is often wrong. 
     I say this because if we go back and put in explicit end tags, 
     everything works fine.
     
     I think we might be able to work around the guessing problem by 
     writing the style sheet carefully. If we just include the entities 
     file and the style sheet doesn't demand 100% accuracy from jade's 
     hierarchy guesses, maybe things will work out.
     
     Another option would be to write our own mini-dtd which would contain 
     only the tags we need. However, since the chunks in question are not 
     really a totally discrete part of the main dtd, it's not clear that 
     this mini-dtd would actually be so mini. Also, since the main dtd is 
     constantly under revision we'd be constantly playing catch-up to keep 
     the mini-dtd in sync.
     
     As for ESIS input, we already convert the big chunks to ESIS in the 
     course of our process; if jade accepted ESIS data, we could just split 
     out the chunks we need and they'd already be fully parsed so the dtd 
     would be irrelevant. Presumably this would be pretty fast. How much 
     work would it be to add this functionality? If it's not too 
     unreasonable, we could do it ourselves.

     James


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread