Subject: Re: (dsssl) Practical Bibliography question From: Trent Shipley <tcshipley@xxxxxxxxxxxxx> Date: Sat, 13 Oct 2001 13:31:00 -0700 |
On Sunday 07 October 2001 22:31, you wrote: > I don't want to appear too intrusive, but I think the model that you > outline is actually too simplistic to be suitable for general use. It > might work for your current task (which is a good enough reason to > pursue this path), but if you look closer the problems creep in. I > have a background in biology/biochemistry/pharmacology, and the > formatting requirements for bibliographies in this field are far more > demanding than what you outlined. In most journals, citing a paper > needs authors, title, publication year, journal name, volume number, > issue number, start and end page. Citing a conference proceeding or a > chapter in a book adds editors, series editors, series title, > publishers and whatnot. Every journal has adopted its own rules for > sequence, formatting and punctuation. In-text citations can be > numerical in square brackets, numerical in angle brackets, numerical > in superscripts, or author-date (with varying number of authors cited, > of course). Multiple citations with adjacent numbers can be cited > explicitly or be folded into a range. First and subsequent citations > of a reference may be treated differently, e.g. first lists all > authors, subsequent list only the first author et al. All this is no > fun business, unfortunately. I am aware that bibliography/long_citation formats are many and varied. Furthermore, from a programmer's point of view they are type dependent. One solution would be to store them in a fully object-oriented database. (Problem number one: I don't know of a *fully* OO database. The one that seems to come closest is Postgres.) Phase zero would be a user friendly interface for adding entries. I say we do this "last." In the hypothetical project. Phase one would be the design of some portable, intermediate format. This should be accessible by text editing tools like vi, emacs, and Notepad(tm). Furthermore, it should be cognizant of existing practices and standards in library science and records management. You would want to look at several XML (and SGML) projects including DocBook and TEI, but also Dublin Core (a project by and for Librarians) and the activities of the Semantic Web working group (that in part build on the Dublin Core). In additon, you would want to familiarize yourself with older document representation and storage formats like MARC. In the end, you expect to wind up with some XML document type for document and media management. It might be sufficient to just borrow some existing biblography standard. At worst the project's XML DTD will be an extension of some existng bibliography base. Phase two is to design a storage, search, retrival and maintenance schema for the data entered in phase zero and put into a cannonical representation in phase one. Here is where the OO database comes into play. Even more than an OO database, what I would love would be what I call a "document-base." This is a type of automated knowledge base with OO functions that uses the structure of a markup language to store, search, retrieve and manage marked-up documents. > While I admire your guts to implement this in DSSSL, I still think > DSSSL plus external preformatting is more suitable for this task. This > is not beautiful in any sense, but it appears to work. The strategy in > my RefDB package is like this (I use DocBook tag names, but I assume > TEI is not too different): Yes this will work. But it is *not* necessary. For example, the commercial product EndNote does not store external formating, but it can return formatted data for inclusion in a Word or WordPerfect document. > In-text citations use a citation element with one to many xref > elements. The latter specify the ID of the reference in an SQL > database. An additional xref element with a special attribute is used > in citations with more than one xref. This too works. It is not *too* cumbersome for hard science where you have at most dozens of refrences and the in-text citation tends to be a non-mnmemonic or semi-mnmemonic abstract reference. It is a bit frustrating for social science and even more for humanities where the reference is a mnemonic primary key (usually author, date, and part of a title). In both cases if you immagine that the users work off a mamoth shared knowledge base then use of abstract IDs becomes cumbersome. It would be much better to use some natural primary key (or approximate primary key), like authors + date + title. [ [ In fact authors + date + title will be an alternate primary key. The knowledge base will actually use an id number (probably an accession number) as its internal primary key. ] ] This is cannonical database engineering. Never force the end-user to use non-meaningful primary keys (like ID numbers) to access the database. Up to this point I do not think I have over-simplified the problem too much. The part where I did oversimplify is in describing the application or application that use the biblographic database to create in-text citations and reference lists that conform to the style manual of a given journal. (Any number of given journals, really) > We have to write an XML document for each bibliography style (i.e. for > each supported journal) that contains all formatting and punctuation > rules for the in-text citations and the bibliography. These styles are > stored in a SQL database for easy access. If we have a universal citation formatting tool (and that *is* the goal), then it needs to know what style manual we are using (and the rules for that style). It will also need to be told or need to infer the type of each citation. We assume it already knows what base document it will be working on. It is reasonable to store the style source code in a database or document base. > The references themselves are stored in another SQL database. They can > contain any additional information like keywords, notes, abstracts to > retrieve them easily. Agreed. (Except for the SQL part. But SQL and full Relational competence is a big plus.) > First we use OpenJade to extract a list of all citation-related > xrefs. Their relation (sequence of the citations, sequence of xrefs > inside the same citation) is preserved. The resulting XML document is > fed to the bibliography tool which pulls the necessary references from > the SQL database, using the proper bibliography style. > The tool > creates "cooked" bibliography entries containing bibliomset elements > with the bibliography data proper ("cooked" means it contains all > punctuation and similar characters which need to be > generated). Additional bibliomset elements are provided for multiple > citations. This way, multiple in-text citations can be displayed > either according to the bibliography style (e.g. as [1-3,5,7-10]) or > as individual citations ([1,2,3,5,7,8,9,10]). The latter case may be > wrong from the viewpoint of the bibliography style, but it preserves > the hyperlinks from the citation to the reference in a suitable output > format (HTML or PDF). The bibliography entries themselves (bibliomixed > elements) carry attributes to identify the database ID, the reference > type (journal, book, abstract, chapter etc), and a label for use as > the in-text citation. I envision a somewhat different sequence. First I consider auto generating non-interactive text for printing. I describe a two pass process. Purists can merge the two passes if they want. --- Use an appropriate query and transform tool (eg OpenJade) for a first pass to convert Pre-Press marked up document A[raw citations] to A[cooked citations]. Extract the xrefs from the text, whether or not they are real xrefs or logical primary keys. Some references may be 1) dangling with no referent. 2) be ambiguous with more than one referent. Note these in the exception log(s). [This is synchronization] [Begin pre-formatting] Pull the structured bibliography data from the knowledge base. Pull the collation data from the designated style sheet. Internal sort authors, editors, etc. for each entry 'External' sort the entries. Log errors and warnings. [End pre-formatting. Begin transform[ Pull the reference style data from the stylesheet. Transform the references to cooked references. Log errors and warnings. Cook the in-text citations. Log errors and warnings. Log summary statistics. [End transform] Phase two: Use a styling tool to make the next step to hardcopy. (If we use OJ and DSSSL then obviously we have TeX --> DVI --> PS | PDF) -------- For HTML you replace the to-text styling tool with another transform phase. Instead of hyper-linking the in-text citations to entries in the master database I would make them internal links to the long citation in the bibliography. (If the bibliography knowledge base is a public or corporate resource sophisticated links might go from there to the bibliography knowledge base browser ... or whatever.) In version _n_ of the software I would want to replace the [1-5] --> [1,2,3,4,5] with something more sophisticated [1-5] -link-> reveal a new window with options for 1,2,..,5 -link-> internal long citation. The long references by release _n_ might also have complex linking options. long reference [1] -link-> 1st reference to me, 2nd, .. ,n -link-> goto selected reference. long reference [1] -link-> goto entry for this reference in the bibliographic database. long reference [1].author-x -link-> return works with contributor-x in the bibliographic database (eg. search web for contributor-x). ETC. > The whole bibliography is written to a valid SGML document which can > be incorporated as an external entity into the original document. Ok > The original document is then processed with the tweaked DocBook > stylesheets. They take care to specially format the RefDB > bibliographies. The in-text citations are pulled from the bibliography > entry labels via the xref mechanism. The bibliography itself is > formatted according to the values of up to 600 variables (in real > life, rarely more than a dozen are used at a time). These values, which > are based on the bibliography style, are exported by the bibliography > tool and are fed into OpenJade by a helper script. This dirty trick > frees us from having to provide one customized stylesheet per journal. Cool. How much of a practical advantage is it to trade in style manual transform programs for that many variables? What do you mean by helper script? Does it just put the values for the variables in the OJ command line? DSSSList info and archive: http://www.mulberrytech.com/dsssl/dssslist
Current Thread |
---|
|
<- Previous | Index | Next -> |
---|---|---|
Re: (dsssl) Practical Bibliography , Markus Hoenicka | Thread | Re: (dsssl) Practical Bibliography , Markus Hoenicka |
Re: (dsssl) raw text, Trent Shipley | Date | Re: (dsssl) Practical Bibliography , Markus Hoenicka |
Month |