Subject: RE: index generation From: "Steffen Heinrich" <heinrich@xxxxxxxxxxxx> Date: Mon, 5 Jul 1999 14:56:09 +0100 |
> From: "Didier PH Martin" <martind@xxxxxxxxxxxxx> > Subject: RE: (Fwd) RE:jade batch mode, index generation ... > indexes you are talking about. Is this relational database indexes > (seems to if you refer to mbd files but I am not so sure). > > Is the goal to produce data base entries from SGML/XML elements? > > Can you clear that up, please, I am in limbo .... Hi Didier, sorry that I failed to express more clearly what I meant by 'index'. (I was too much consumed by this stuff lately.) No, it's not database indexes (as are generated for general performance gain in database environmentsl) but word pointers to documents in a compact binary format the use of which I will describe further in the following. (I'm rather sure, there are some people on the list who delved deeper into this subject than I did, and know probably much more about the backgrounds. Please correct me where I'm wrong.) I do not know the conversion process that Sean writes about, but I understand that their applications write the 'word info' (_which_ words are to be found _where_ in _which documents_) to a binary file in MSAccess (.mdb)-format. This is not compatible, but comparable to the Berkeley DB-format that I am using. If you are doing it all yourself, you may think up your own information format (the way that occurances are coded for each word contained in the index) and write it to a file organized in any way you choose (this affects the way and effectiveness on retrieval of the word info for a specific word). The latter must not be reinvented since effective lookups from persistent data files is an ever repeating task in data processing and there are dozens of formats with hundreds of APIs for any programming language. In my case I decided on the use of a modified 'word-info'-format which I deciphered from Oracles retrieval cartridge (Context) and on the Berkeley DB -format for data storage (source code available from sleepycat.com). An example on how the need for a search application may arise: My former company (I quit recently :-))) re-published formerly printed journals on CD-ROM and on the web, both versions in HTML. (The articles transformation went smooth after they had been manually converted to our SGML.) Now, we needed to provide document retrieval facilities for distribution with these archive-products. Based on Tim Kientzles java search for 'Dr. Dobb's Journal' (http://www.ddj.com/), I decided to adopt for it the way how Oracles fulltext retrieval cartridge stores word information in a very compact and efficient way. This format also allows to retain so-called 'user defined sections' information. That is the position of selected start- and end-tags within the parsed SGML-documents. All this information is then written to a binary-tree DB_file. This data is commonly called 'index' although it has nothing to do with MSAccess or with the concepts of RDBMSes. It's just a structured binary file that stores information which can be accessed by applications in a very efficient way. I.e. you can read very quickly whether and which information has been stored for the word 'MAXILLOFACIAL'. Or get the start and end of all <PRODUCT>...</PRODUCT> sections in every document. So the application can easily figure out those 'hits' that fall into any of the sections that a user has choosen to restrict his search to. The index file, once it is generated, can be accessed by users via perl-script (to be used in a web server environment) and via Java classes (from within a Browser in a local environment on PC, Mac, Unix). My search frontends offer the choice of AND/OR-concatenation for multiple search words as well as the end truncation of search words ('maxillo*'). I will be happy to answer any further questions concerning the search concepts. The reason to bring the subject up in this forum followed these thoughts: - Jade is often used to produce output that will be delivered on electronical media. - In this case, or additionally to the print format, means for convenient document retrieval is desireble and should be based on the underlying structured SGML-information for maximum usefullness. - Jade users take supposedly very different approaches and have subsequently to follow significant detours to accomplish a similar task. (Read the message from Sean Hennesy. ) - Jade could possibly boost it's popularity if a simple switch provides an index database which can be used by free and easy to to configure frontends. I am fully aware that the standard's query language follows it's own ideas for the implementation of search facilities (let alone HyTime). Yet, I haven't seen a practical implementation and hence can't help to feel that a simple and less academic search on static document pools (no changes of the documents after being indexed) could advertise the virtues of structured information beyond it's current heralds and following. Steffen --------- steffen heinrich, berlin, germany "When you're chewing on life's gristle Don't grumble, give a whistle And DSSSL helps things turn out for the best..." (Monty Python overheard) DSSSList info and archive: http://www.mulberrytech.com/dsssl/dssslist
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Call for presentations: XML Develop, Jon Bosak | Thread | RE: index generation, Didier PH Martin |
RE: Unregistered flow objects, Didier PH Martin | Date | RE:jade batch mode, Steffen Heinrich |
Month |