Re: (dsssl) Practical Bibliography question

Subject: Re: (dsssl) Practical Bibliography question
From: "Markus Hoenicka" <hoenicka_markus@xxxxxxxxxxxxxx>
Date: Sat, 13 Oct 2001 23:22:49 -0500
Trent Shipley writes:
 > Furthermore, it should be cognizant of existing practices and standards in 
 > library science and records management.  You would want to look at several 
 > XML (and SGML)  projects including DocBook and TEI, but also Dublin Core (a 
 > project by and for Librarians) and the activities of the Semantic Web working 
 > group (that in part build on the Dublin Core).  In additon, you would want to 
 > familiarize yourself with older document representation and storage formats 
 > like MARC.

RefDB currently lacks these capabilities. It is not meant to be a
system used by librarians. It is rather limited to the scope of what
Reference Manager and EndNote do: let that scientist manage his
references and create bibliographies.

 >       In the end, you expect to wind up with some XML document type for 
 > document and media management.  It might be sufficient to just borrow some 
 > existing biblography standard.  At worst the project's XML DTD will be an 
 > extension of some existng bibliography base.

RefDB is based on RIS which is a tagged (non-SGML-like) format used by
essentially all end-user reference managers. It was one of the design
constraints to be compatible to existing commercial reference
managers. SGML/XML-based input could easily be added if it is designed
as a superset of what RIS offers. This might make the librarian happy

 > Phase two is to design a storage, search, retrival and maintenance schema for 
 > the data entered in phase zero and put into a cannonical representation in 
 > phase one.
 >     Here is where the OO database comes into play.  Even more than an OO 
 > database, what I would love would be what I call a "document-base."  This is 
 > a type of automated knowledge base with OO functions that uses the structure 
 > of a markup language to store, search, retrieve and manage marked-up 
 > documents. 

I'm not aware of such a tool yet. Existing XML databases are not OO
afaik, and the search/retrieve capabilities are far less advanced than
even the lamest SQL implementation.

 > > While I admire your guts to implement this in DSSSL, I still think
 > > DSSSL plus external preformatting is more suitable for this task. This
 > > is not beautiful in any sense, but it appears to work. The strategy in
 > > my RefDB package is like this (I use DocBook tag names, but I assume
 > > TEI is not too different):
 > Yes this will work.  But it is *not* necessary.  For example, the commercial 
 > product EndNote does not store external formating, but it can return 
 > formatted data for inclusion in a Word or WordPerfect document.

Maybe I don't get your point here. RefDB does not store any external
formatting, the datasets are as raw as can be. The RefDB bibliography
tool does preformatting, though: create the proper character sequence
for each element (e.g. authorname formatting: F.M. Last or Last,
F. M. or Last,F.M. or Last FM or whatever), and create the proper
element sequence with the proper punctuation inbetween. This
preformatting is performed on the fly whenever a bibliography is
requested, and this is based on the requested reference style.

 > the reference is a mnemonic primary key (usually author, date, and part of a 
 > title).
 > In both cases if you immagine that the users work off a mamoth shared 
 > knowledge base then use of abstract IDs becomes cumbersome.  It would be much 
 > better to use some natural primary key (or approximate primary key), like 
 > authors + date + title.  
 > [ [
 > In fact authors + date + title will be an alternate primary key.  The 
 > knowledge base will actually use an id number (probably an accession number) 
 > as its internal primary key.
 > ] ]
 > This is cannonical database engineering.  Never force the end-user to use 
 > non-meaningful primary keys (like ID numbers) to access the database.
This could be implemented, although it raises a few questions:
E.g. how do you know the key in advance? If it uses a part of the
title, which part? How is capitalization handled? What happens if you
know only one of several authors? etc. My experience with citing is
that you have to look it up in the database anyway. In that case, I
prefer to enter three or four digits into my xref element instead of
author, date, title. If you really need a hint what publication that
is, why not add this in a SGML comment?

 > Up to this point I do not think I have over-simplified the problem too much.
 > The part where I did oversimplify is in describing the application or 
 > application that use the biblographic database to create in-text citations 
 > and reference lists that conform to the style manual of a given journal.  
 > (Any number of given journals, really)
 > > We have to write an XML document for each bibliography style (i.e. for
 > > each supported journal) that contains all formatting and punctuation
 > > rules for the in-text citations and the bibliography. These styles are
 > > stored in a SQL database for easy access.
 > If we have a universal citation formatting tool (and that *is* the goal), 
 > then it needs to know what style manual we are using (and the rules for that 
 > style).  It will also need to be told or need to infer the type of each 
 > citation.  We assume it already knows what base document it will be working 
 > on.

The type of the citation must be in the reference dataset in the
database (this is how RefDB handles it). Nothing else would reasonably

 > It is reasonable to store the style source code in a database or document 
 > base.
 > > The references themselves are stored in another SQL database. They can
 > > contain any additional information like keywords, notes, abstracts to
 > > retrieve them easily.
 > Agreed.  (Except for the SQL part.  But SQL and full Relational competence is 
 > a big plus.)
SQL is for practical reasons only. I don't infer any theoretical
advantage here. SQL implementations are widely available, and the
software could be made implementation-independent. RefDB currently
handles only MySQL, but support for other databases will be added

 > I envision a somewhat different sequence.  First I consider auto generating 
 > non-interactive text for printing.  I describe a two pass process.  Purists 
 > can merge the two passes if they want.
 > ---
 > Use an appropriate query and transform tool (eg OpenJade) for a first pass to 
 > convert Pre-Press marked up document A[raw citations] to A[cooked citations].
 > Extract the xrefs from the text, whether or not they are real xrefs or 
 > logical primary keys.  
 >     Some references may be 1) dangling with no referent. 2) be ambiguous with 
 > more than one referent.  Note these in the exception log(s).  [This is 
 > synchronization]
 > [Begin pre-formatting]
 > Pull the structured bibliography data from the knowledge base.  Pull the 
 > collation data from the designated style sheet.  
 >      Internal sort authors, editors, etc. for each entry
 >      'External' sort the entries.
 >      Log errors and warnings.
 > [End pre-formatting.  Begin transform[
 > Pull the reference style data from the stylesheet.
 > Transform the references to cooked references.
 >     Log errors and warnings.
 > Cook the in-text citations.
 >     Log errors and warnings.
 >     Log summary statistics.
 > [End transform]
 > Phase two: Use a styling tool to make the next step to hardcopy.  (If we use 
 > OJ and DSSSL then obviously we have TeX --> DVI --> PS | PDF) 
 > --------
 > For HTML you replace the to-text styling tool with another transform phase.
If I understand you correctly, RefDB does pretty much what you suggest.

 > Instead of hyper-linking the in-text citations to entries in the master 
 > database I would make them internal links to the long citation in the 
 > bibliography.  (If the bibliography knowledge base is a public or corporate 
 > resource sophisticated links might go from there to the bibliography 
 > knowledge base browser ... or whatever.)

I'm afraid you took me all wrong here. The hyperlinks go from the
in-text citation to the corresponding entry in the bibliography,
i.e. to another location in the same document. The final document is
self-contained, you can walk away from the SQL database and all RefDB
tools and and process the document like any other SGML
document. The printable or HTML output is also self-contained with
respect to the citation/reference stuff as no hyperlinks to locations
outside of the current document are created.

 > How much of a practical advantage is it to trade in style manual transform 
 > programs for that many variables?

What exactly does "in style manual transform programs" mean in this
context. I'm afraid I don't understand.
 > What do you mean by helper script?
 > Does it just put the values for the variables in the OJ command line?
Exactly. This is one of two solutions to get the variable values into
the stylesheet. The other solution is to create a customized
stylesheet on the fly with the appropriate values. None of these
solutions has exceptional elegance, so I used the solution which is
easier to implement. The downside is that this does not work with good
ol' Jade (you can only set variables to "true" but not to a specific
value), so I'll probably have to implement the other solution as well.


Markus Hoenicka

 DSSSList info and archive:

Current Thread