Re: (dsssl) Practical Bibliography question

Subject: Re: (dsssl) Practical Bibliography question
From: Trent Shipley <tcshipley@xxxxxxxxxxxxx>
Date: Sat, 13 Oct 2001 13:31:00 -0700
On Sunday 07 October 2001 22:31, you wrote:
> I don't want to appear too intrusive, but I think the model that you
> outline is actually too simplistic to be suitable for general use. It
> might work for your current task (which is a good enough reason to
> pursue this path), but if you look closer the problems creep in. I
> have a background in biology/biochemistry/pharmacology, and the
> formatting requirements for bibliographies in this field are far more
> demanding than what you outlined. In most journals, citing a paper
> needs authors, title, publication year, journal name, volume number,
> issue number, start and end page. Citing a conference proceeding or a
> chapter in a book adds editors, series editors, series title,
> publishers and whatnot. Every journal has adopted its own rules for
> sequence, formatting and punctuation. In-text citations can be
> numerical in square brackets, numerical in angle brackets, numerical
> in superscripts, or author-date (with varying number of authors cited,
> of course). Multiple citations with adjacent numbers can be cited
> explicitly or be folded into a range. First and subsequent citations
> of a reference may be treated differently, e.g. first lists all
> authors, subsequent list only the first author et al. All this is no
> fun business, unfortunately.

I am aware that bibliography/long_citation formats are many and varied.  
Furthermore, from a programmer's point of view they are type dependent.

One solution would be to store them in a fully object-oriented database.  
(Problem number one: I don't know of a *fully* OO database.  The one that 
seems to come closest is Postgres.)

Phase zero would be a user friendly interface for adding entries.  I say we 
do this "last." In the hypothetical project.

Phase one would be the design of some portable, intermediate format.  This 
should be accessible by text editing tools like vi, emacs, and Notepad(tm).  
Furthermore, it should be cognizant of existing practices and standards in 
library science and records management.  You would want to look at several 
XML (and SGML)  projects including DocBook and TEI, but also Dublin Core (a 
project by and for Librarians) and the activities of the Semantic Web working 
group (that in part build on the Dublin Core).  In additon, you would want to 
familiarize yourself with older document representation and storage formats 
like MARC.
      In the end, you expect to wind up with some XML document type for 
document and media management.  It might be sufficient to just borrow some 
existing biblography standard.  At worst the project's XML DTD will be an 
extension of some existng bibliography base.

Phase two is to design a storage, search, retrival and maintenance schema for 
the data entered in phase zero and put into a cannonical representation in 
phase one.
    Here is where the OO database comes into play.  Even more than an OO 
database, what I would love would be what I call a "document-base."  This is 
a type of automated knowledge base with OO functions that uses the structure 
of a markup language to store, search, retrieve and manage marked-up 
documents. 


> While I admire your guts to implement this in DSSSL, I still think
> DSSSL plus external preformatting is more suitable for this task. This
> is not beautiful in any sense, but it appears to work. The strategy in
> my RefDB package is like this (I use DocBook tag names, but I assume
> TEI is not too different):

Yes this will work.  But it is *not* necessary.  For example, the commercial 
product EndNote does not store external formating, but it can return 
formatted data for inclusion in a Word or WordPerfect document.

> In-text citations use a citation element with one to many xref
> elements. The latter specify the ID of the reference in an SQL
> database. An additional xref element with a special attribute is used
> in citations with more than one xref.

This too works.  It is not *too* cumbersome for hard science where you have 
at most dozens of refrences and the in-text citation tends to be a 
non-mnmemonic or semi-mnmemonic abstract reference.

It is a bit frustrating for social science and even more for humanities where 
the reference is a mnemonic primary key (usually author, date, and part of a 
title).

In both cases if you immagine that the users work off a mamoth shared 
knowledge base then use of abstract IDs becomes cumbersome.  It would be much 
better to use some natural primary key (or approximate primary key), like 
authors + date + title.  

[ [
In fact authors + date + title will be an alternate primary key.  The 
knowledge base will actually use an id number (probably an accession number) 
as its internal primary key.
] ]

This is cannonical database engineering.  Never force the end-user to use 
non-meaningful primary keys (like ID numbers) to access the database.

Up to this point I do not think I have over-simplified the problem too much.

The part where I did oversimplify is in describing the application or 
application that use the biblographic database to create in-text citations 
and reference lists that conform to the style manual of a given journal.  
(Any number of given journals, really)

> We have to write an XML document for each bibliography style (i.e. for
> each supported journal) that contains all formatting and punctuation
> rules for the in-text citations and the bibliography. These styles are
> stored in a SQL database for easy access.

If we have a universal citation formatting tool (and that *is* the goal), 
then it needs to know what style manual we are using (and the rules for that 
style).  It will also need to be told or need to infer the type of each 
citation.  We assume it already knows what base document it will be working 
on.

It is reasonable to store the style source code in a database or document 
base.

> The references themselves are stored in another SQL database. They can
> contain any additional information like keywords, notes, abstracts to
> retrieve them easily.

Agreed.  (Except for the SQL part.  But SQL and full Relational competence is 
a big plus.)

> First we use OpenJade to extract a list of all citation-related
> xrefs. Their relation (sequence of the citations, sequence of xrefs
> inside the same citation) is preserved. The resulting XML document is
> fed to the bibliography tool which pulls the necessary references from
> the SQL database, using the proper bibliography style.
> The tool
> creates "cooked" bibliography entries containing bibliomset elements
> with the bibliography data proper ("cooked" means it contains all
> punctuation and similar characters which need to be
> generated). Additional bibliomset elements are provided for multiple
> citations. This way, multiple in-text citations can be displayed
> either according to the bibliography style (e.g. as [1-3,5,7-10]) or
> as individual citations ([1,2,3,5,7,8,9,10]). The latter case may be
> wrong from the viewpoint of the bibliography style, but it preserves
> the hyperlinks from the citation to the reference in a suitable output
> format (HTML or PDF). The bibliography entries themselves (bibliomixed
> elements) carry attributes to identify the database ID, the reference
> type (journal, book, abstract, chapter etc), and a label for use as
> the in-text citation.

I envision a somewhat different sequence.  First I consider auto generating 
non-interactive text for printing.  I describe a two pass process.  Purists 
can merge the two passes if they want.

---

Use an appropriate query and transform tool (eg OpenJade) for a first pass to 
convert Pre-Press marked up document A[raw citations] to A[cooked citations].

Extract the xrefs from the text, whether or not they are real xrefs or 
logical primary keys.  
    Some references may be 1) dangling with no referent. 2) be ambiguous with 
more than one referent.  Note these in the exception log(s).  [This is 
synchronization]

[Begin pre-formatting]

Pull the structured bibliography data from the knowledge base.  Pull the 
collation data from the designated style sheet.  
     Internal sort authors, editors, etc. for each entry
     'External' sort the entries.
     Log errors and warnings.

[End pre-formatting.  Begin transform[

Pull the reference style data from the stylesheet.
Transform the references to cooked references.
    Log errors and warnings.

Cook the in-text citations.
    Log errors and warnings.
    Log summary statistics.

[End transform]

Phase two: Use a styling tool to make the next step to hardcopy.  (If we use 
OJ and DSSSL then obviously we have TeX --> DVI --> PS | PDF) 

--------

For HTML you replace the to-text styling tool with another transform phase.

Instead of hyper-linking the in-text citations to entries in the master 
database I would make them internal links to the long citation in the 
bibliography.  (If the bibliography knowledge base is a public or corporate 
resource sophisticated links might go from there to the bibliography 
knowledge base browser ... or whatever.)

In version _n_ of the software I would want to replace the 
[1-5] --> [1,2,3,4,5] with something more sophisticated

[1-5] -link-> reveal a new window with options for 1,2,..,5 -link-> internal 
long citation.

The long references by release _n_ might also have complex linking options.

long reference [1] -link-> 1st reference to me, 2nd, .. ,n -link-> goto 
selected reference.

long reference [1] -link-> goto entry for this reference in the bibliographic 
database.

long reference [1].author-x -link-> return works with contributor-x in the 
bibliographic database (eg. search web for contributor-x).

ETC.

> The whole bibliography is written to a valid SGML document which can
> be incorporated as an external entity into the original document.

Ok
 
> The original document is then processed with the tweaked DocBook
> stylesheets. They take care to specially format the RefDB
> bibliographies. The in-text citations are pulled from the bibliography
> entry labels via the xref mechanism. The bibliography itself is
> formatted according to the values of up to 600 variables (in real
> life, rarely more than a dozen are used at a time). These values, which
> are based on the bibliography style, are exported by the bibliography
> tool and are fed into OpenJade by a helper script. This dirty trick
> frees us from having to provide one customized stylesheet per journal.

Cool.

How much of a practical advantage is it to trade in style manual transform 
programs for that many variables?

What do you mean by helper script?

Does it just put the values for the variables in the OJ command line?


 DSSSList info and archive:  http://www.mulberrytech.com/dsssl/dssslist

Current Thread