RE: index generation

Subject: RE: index generation
From: "Steffen Heinrich" <heinrich@xxxxxxxxxxxx>
Date: Mon, 5 Jul 1999 14:56:09 +0100
> From: "Didier PH Martin" <martind@xxxxxxxxxxxxx>
> Subject: RE: (Fwd) RE:jade batch mode, index generation
... 
> indexes you are talking about. Is this relational database indexes
> (seems to if you refer to mbd files but I am not so sure).
> 
> Is the goal to produce data base entries from SGML/XML elements?
> 
> Can you clear that up, please, I am in limbo
....

Hi Didier, 

sorry that I failed to express more clearly what I meant by 'index'. 
(I was too much consumed by this stuff lately.)

No, it's not database indexes (as are generated for general 
performance gain in database environmentsl) but word pointers to 
documents in a compact binary format the use of which I will 
describe further in the following. 
(I'm rather sure, there are some people on the list who delved deeper 
into this subject than I did, and know probably much more about the 
backgrounds. Please correct me where I'm wrong.) 

I do not know the conversion process that Sean writes about, but I 
understand that their applications write the 'word info' (_which_ 
words are to be found _where_ in _which documents_) to a binary file 
in MSAccess (.mdb)-format. This is not compatible, but comparable to 
the Berkeley DB-format that I am using.  
If you are doing it all yourself, you may think up your own 
information format (the way that occurances are coded for each word 
contained in the index) and write it to a file organized in any way 
you choose (this affects the way and effectiveness on retrieval of 
the word info for a specific word). The latter must not be reinvented 
since effective lookups from persistent data files is an ever 
repeating task in data processing and there are dozens of formats 
with hundreds of APIs for any programming language. 
In my case I decided on the use of a modified 'word-info'-format 
which I deciphered from Oracles retrieval cartridge (Context) 
and on the Berkeley DB -format for data storage (source code 
available from sleepycat.com).

An example on how the need for a search application may arise: 
My former company (I quit recently :-))) re-published formerly 
printed journals on CD-ROM and on the web, both versions in HTML. 
(The articles transformation went smooth after they had been manually 
converted to our SGML.)

Now, we needed to provide document retrieval facilities for 
distribution with these archive-products. Based on Tim Kientzles java 
search  for 'Dr. Dobb's Journal' (http://www.ddj.com/),  I decided to 
adopt for it the way how Oracles fulltext retrieval cartridge stores 
word information in a very compact and efficient way. This format 
also allows to retain so-called 'user defined sections' information. 
That is the position of selected start- and end-tags within the 
parsed SGML-documents. 
All this information is then written to a binary-tree DB_file. This 
data is commonly called 'index' although it has nothing to do with 
MSAccess or with the concepts of RDBMSes. It's just a structured 
binary file that stores information which can be accessed by 
applications in a very efficient way. I.e. you can read very quickly 
whether and which information has been stored for the word 
'MAXILLOFACIAL'. Or get the start and end of all 
<PRODUCT>...</PRODUCT> sections in every document. So the application 
can easily figure out those 'hits' that fall into any of the sections 
that a user has choosen to restrict his search to. 
The index file, once it is generated, can be accessed by users via 
perl-script (to be used in a web server environment) and via Java 
classes (from within a Browser in a local environment on PC, Mac, 
Unix). My search frontends offer the choice of AND/OR-concatenation 
for multiple search words as well as the end truncation of search 
words ('maxillo*'). 
I will be happy to answer any further questions concerning the 
search concepts. 

The reason to bring the subject up in this forum followed these 
thoughts: 
- Jade is often used to produce output that will be delivered 
on electronical media. 
- In this case, or additionally to the print format, means for 
convenient document retrieval is desireble and should be based on 
the underlying structured SGML-information for maximum usefullness. 
- Jade users take supposedly  very different approaches and have 
subsequently to follow significant detours to accomplish a similar 
task. (Read the message from Sean Hennesy. ) 
 - Jade could possibly boost it's popularity if a simple switch 
provides an index database which can be used by free and easy to 
to configure frontends. 

I am fully aware that the standard's query language follows it's own 
ideas for the implementation of search facilities (let alone HyTime). 
Yet, I haven't seen a practical implementation and hence can't help 
to feel that a simple and less academic search on static document 
pools (no changes of the documents after being indexed) could 
advertise the virtues of structured information beyond it's current 
heralds and following. 

Steffen
 
---------
steffen heinrich, berlin, germany
"When you're chewing on life's gristle 
Don't grumble, give a whistle
And DSSSL helps things turn out for the best..."
(Monty Python overheard)


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread