Analysis of Usage Patterns (was Re: In The News)

Subject: Analysis of Usage Patterns (was Re: In The News)
From: Richard Trott <richard.trott@xxxxxxxxxxxxxxxx>
Date: Wed, 26 Jun 2002 14:34:44 -0700 (PDT)
On 26 Jun 2002 digital-copyright-digest-help@xxxxxxxxxxxxxx wrote:

> Evaluation of Digital Library Impact and User Communities by Analysis of
> Usage Patterns
> By Johan Bollen and Rick Luce, D-Lib Magazine, June 2002 Volume 8 Number
> 6
> ISSN 1082-9873
> http://www.dlib.org/dlib/june02/bollen/06bollen.html
>
> "At present, digital library (DL) policy is largely informed by
> management intuition and coarse measures of user satisfaction. Most DLs,
> however, maintain extensive server logs of user retrieval requests that
> contain a wealth of information on user preferences and the structure of
> user retrieval patterns. We propose a quantitative approach to DL
> evaluation that analyzes the retrieval habits of users to assess the
> impact of a collection of documents and to determine the structure of a
> given DL user community. We discuss a system that we have developed to
> automatically generate extensive journal and document networks from an
> efficient and simple analysis of user retrieval sequences registered in
> a particular DL's server logs."
> ------------

Did anyone else read this and find the central assumption problematic?
Specifically, it is declared that "when a user retrieves two documents
within a short period of time, it adds support to the claim that some
level of similarity exists between these documents."  No evidence is given
for this statement.  It is offered simply as common sense, and the rest of
the paper appears dependent upon this premise.  Personally, I don't
believe the premise to be true.

Let's say one is looking for information on Charles Mackay's 1859 writings
about those whom he termed "the slow poisoners".  One might (for example)
fire up www.altavista.com and enter the words "slow poisoners" (without
the quotatioin marks).  The top site returned is www.slowpoisoners.com
which sounds very promising indeed.  The user follows that link.  There,
they discover that the Web site is for a San Francisco band called the
Slow Poisoners.  The Web page they retrieve contains no information
whatsoever about Charles Mackay nor his writings on the subject.  The user
hits the back button, getting a cached version of the altavista search
results.  They go to the next site on the list, which is
www.bootlegbooks.com and the link is to the text of chapter 11 of Mackay's
1859 book.  Chapter 11 is entitled "The Slow Poisoners" and is all about
exactly what the user is looking for.  Bingo.

If the user is using a proxy server, then the logs will show a visit to
altavista, followed quickly by a visit to slowpoisoners.com, followed
quickly by a visit to the specific information at bootlegbooks.com.
Analysis using the techniques described in the paper will result in a
similarity between slowpoisoners.com and the page at bootlegbooks.com
being falsely ascribed.

On the other hand, if the top pages returned were two links that were in
fact similar and of interest to the reader, then the user might very well
spend a lot of time at the first link before moving on to the second link.
The techniques described in the paper would result in a false conclusion
that the two links are not as similar as slowpoisoners.com and the
bootlegbooks.com page.

Basically, the technique assumes every document retrieval is significant.
However, in many situations, if retrieving documents is simple (which it
hopefully is), most document retrievals will not be significant.  The user
might quickly skim through a half dozen or more documents that are not
what they are looking for before finding the one that is.  In this
situation, the relationship weight ascribed by the techniques in the paper
will tell you much more about your document search engine and your user's
savvy with using that search engine than about the documents itself.  Or
it might not.  It's difficult (impossible?) to know.  It also might (or
might not) tell you more about the hyperlink structure at your site than
about the relationships of document content.  Again, it is difficult
(perhaps impossible) to know.

If you have a set of documents that are retrieved solely by a method that
allows the user to request, "the paper entitled 'FooBar' from the _Journal
of FooBarOlogy_ volume 3 number 4, by Smith, Trott, and Wesson," then the
system described in the paper may work exceptionally well.  However,
broader application is questionable in my view.

Or am I being naive and missing something crucial?

Rich




Current Thread