RE: index generation

Subject: RE: index generation
From: "Didier PH Martin" <martind@xxxxxxxxxxxxx>
Date: Sat, 3 Jul 1999 21:21:35 -0400
Hi Steffen

This time I got it. Basically, you need an inverted index file with key
words used to build indexes pointing to document or document fragments.

My guess is that, if we could encapsulate a set of procedure and flow object
in an external library to index or provide access to an index, this could be
indeed very useful. I'll include in my notes this requirement.

Thank you Steffen for your suggestion and comments
regards
Didier PH Martin
mailto:martind@xxxxxxxxxxxxx
http://www.netfolder.com
-----Original Message-----
From: owner-dssslist@xxxxxxxxxxxxxxxx
[mailto:owner-dssslist@xxxxxxxxxxxxxxxx]On Behalf Of Steffen Heinrich
Sent: Monday, July 05, 1999 9:56 AM
To: dssslist@xxxxxxxxxxxxxxxx
Subject: RE: index generation


> From: "Didier PH Martin" <martind@xxxxxxxxxxxxx>
> Subject: RE: (Fwd) RE:jade batch mode, index generation
...
> indexes you are talking about. Is this relational database indexes
> (seems to if you refer to mbd files but I am not so sure).
>
> Is the goal to produce data base entries from SGML/XML elements?
>
> Can you clear that up, please, I am in limbo
....

Hi Didier,

sorry that I failed to express more clearly what I meant by 'index'.
(I was too much consumed by this stuff lately.)

No, it's not database indexes (as are generated for general
performance gain in database environmentsl) but word pointers to
documents in a compact binary format the use of which I will
describe further in the following.
(I'm rather sure, there are some people on the list who delved deeper
into this subject than I did, and know probably much more about the
backgrounds. Please correct me where I'm wrong.)

I do not know the conversion process that Sean writes about, but I
understand that their applications write the 'word info' (_which_
words are to be found _where_ in _which documents_) to a binary file
in MSAccess (.mdb)-format. This is not compatible, but comparable to
the Berkeley DB-format that I am using.
If you are doing it all yourself, you may think up your own
information format (the way that occurances are coded for each word
contained in the index) and write it to a file organized in any way
you choose (this affects the way and effectiveness on retrieval of
the word info for a specific word). The latter must not be reinvented
since effective lookups from persistent data files is an ever
repeating task in data processing and there are dozens of formats
with hundreds of APIs for any programming language.
In my case I decided on the use of a modified 'word-info'-format
which I deciphered from Oracles retrieval cartridge (Context)
and on the Berkeley DB -format for data storage (source code
available from sleepycat.com).

An example on how the need for a search application may arise:
My former company (I quit recently :-))) re-published formerly
printed journals on CD-ROM and on the web, both versions in HTML.
(The articles transformation went smooth after they had been manually
converted to our SGML.)

Now, we needed to provide document retrieval facilities for
distribution with these archive-products. Based on Tim Kientzles java
search  for 'Dr. Dobb's Journal' (http://www.ddj.com/),  I decided to
adopt for it the way how Oracles fulltext retrieval cartridge stores
word information in a very compact and efficient way. This format
also allows to retain so-called 'user defined sections' information.
That is the position of selected start- and end-tags within the
parsed SGML-documents.
All this information is then written to a binary-tree DB_file. This
data is commonly called 'index' although it has nothing to do with
MSAccess or with the concepts of RDBMSes. It's just a structured
binary file that stores information which can be accessed by
applications in a very efficient way. I.e. you can read very quickly
whether and which information has been stored for the word
'MAXILLOFACIAL'. Or get the start and end of all
<PRODUCT>...</PRODUCT> sections in every document. So the application
can easily figure out those 'hits' that fall into any of the sections
that a user has choosen to restrict his search to.
The index file, once it is generated, can be accessed by users via
perl-script (to be used in a web server environment) and via Java
classes (from within a Browser in a local environment on PC, Mac,
Unix). My search frontends offer the choice of AND/OR-concatenation
for multiple search words as well as the end truncation of search
words ('maxillo*').
I will be happy to answer any further questions concerning the
search concepts.

The reason to bring the subject up in this forum followed these
thoughts:
- Jade is often used to produce output that will be delivered
on electronical media.
- In this case, or additionally to the print format, means for
convenient document retrieval is desireble and should be based on
the underlying structured SGML-information for maximum usefullness.
- Jade users take supposedly  very different approaches and have
subsequently to follow significant detours to accomplish a similar
task. (Read the message from Sean Hennesy. )
 - Jade could possibly boost it's popularity if a simple switch
provides an index database which can be used by free and easy to
to configure frontends.

I am fully aware that the standard's query language follows it's own
ideas for the implementation of search facilities (let alone HyTime).
Yet, I haven't seen a practical implementation and hence can't help
to feel that a simple and less academic search on static document
pools (no changes of the documents after being indexed) could
advertise the virtues of structured information beyond it's current
heralds and following.

Steffen

---------
steffen heinrich, berlin, germany
"When you're chewing on life's gristle
Don't grumble, give a whistle
And DSSSL helps things turn out for the best..."
(Monty Python overheard)


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist


Current Thread