Re: Analysis of Usage Patterns (was Re: In The News)

Subject: Re: Analysis of Usage Patterns (was Re: In The News)
From: Johan Bollen <jbollen@xxxxxxxxxx>
Date: Fri, 28 Jun 2002 11:31:24 -0400
Richard,

thanks for the detailed comments on our article "Evaluation of Digital 
Library Impact and User Communities by Analysis of Usage Patterns" that 
appeared in Dlib magazine (June 2002).

Rick and I would like to  briefly comment on your e-mail.

There is no absolute proof for the assumption that if a user retrieves two 
articles in close temporal proximity they are thereby related to some degree. 
However, it is not so much an assumption about user retrieval behavior than 
it is a rule which we use in our system to generate journal link weights, and 
one that has greatly helped us construct highly meaningful document and 
journal networks. Our data shows this approach works very well and produces 
document networks that validly represent the preferences of users.

The methodology we outlined is strongly influenced by machine learning and 
Hebbian learning in particular. Hebbian learning operates on an 
*accumulation* of evidence. One single co-activation of neuron A and B only 
slightly changes their link weight. Many co-activations of neuron A and B 
however exert a larger influence on their link weight.  The same applies to 
our methodology. Although every co-retrieval of two documents is taken into 
account you need many to significantly change their particular link weight.

Mistakes, errors, etc. can and will occur. However, when they are relatively 
isolated events, they will have no significant impact on network structure. 
The system operates on persistent, large-scale patterns of co-retrieval.

It is definitely true that you can not use this technique for every existing 
digital information resource, and there are many pittfals to its generalized 
use on the WWW, some of which you identify. Misleading anchors and labels on 
the WWW may indeed cause problems. However, we have achieved excellent 
results for DL environments. One possible explanation could be that the LANL 
Research Library offers users an abstract and other meta-data before actually 
downloading the article. In this manner users can actually determine the 
relevancy of a given article to some degree before downloading it. However 
even without such features controls, we and others have had good results on 
large web sites.

Regards,

Rick Luce and Johan Bollen.

On Wednesday 26 June 2002 05:34 pm, you wrote:
> On 26 Jun 2002 digital-copyright-digest-help@xxxxxxxxxxxxxx wrote:
> > Evaluation of Digital Library Impact and User Communities by Analysis of
> > Usage Patterns
> > By Johan Bollen and Rick Luce, D-Lib Magazine, June 2002 Volume 8 Number
> > 6
> > ISSN 1082-9873
> > http://www.dlib.org/dlib/june02/bollen/06bollen.html
> >
> > "At present, digital library (DL) policy is largely informed by
> > management intuition and coarse measures of user satisfaction. Most DLs,
> > however, maintain extensive server logs of user retrieval requests that
> > contain a wealth of information on user preferences and the structure of
> > user retrieval patterns. We propose a quantitative approach to DL
> > evaluation that analyzes the retrieval habits of users to assess the
> > impact of a collection of documents and to determine the structure of a
> > given DL user community. We discuss a system that we have developed to
> > automatically generate extensive journal and document networks from an
> > efficient and simple analysis of user retrieval sequences registered in
> > a particular DL's server logs."
> > ------------
>
> Did anyone else read this and find the central assumption problematic?
> Specifically, it is declared that "when a user retrieves two documents
> within a short period of time, it adds support to the claim that some
> level of similarity exists between these documents."  No evidence is given
> for this statement.  It is offered simply as common sense, and the rest of
> the paper appears dependent upon this premise.  Personally, I don't
> believe the premise to be true.
>
> Let's say one is looking for information on Charles Mackay's 1859 writings
> about those whom he termed "the slow poisoners".  One might (for example)
> fire up www.altavista.com and enter the words "slow poisoners" (without
> the quotatioin marks).  The top site returned is www.slowpoisoners.com
> which sounds very promising indeed.  The user follows that link.  There,
> they discover that the Web site is for a San Francisco band called the
> Slow Poisoners.  The Web page they retrieve contains no information
> whatsoever about Charles Mackay nor his writings on the subject.  The user
> hits the back button, getting a cached version of the altavista search
> results.  They go to the next site on the list, which is
> www.bootlegbooks.com and the link is to the text of chapter 11 of Mackay's
> 1859 book.  Chapter 11 is entitled "The Slow Poisoners" and is all about
> exactly what the user is looking for.  Bingo.
>
> If the user is using a proxy server, then the logs will show a visit to
> altavista, followed quickly by a visit to slowpoisoners.com, followed
> quickly by a visit to the specific information at bootlegbooks.com.
> Analysis using the techniques described in the paper will result in a
> similarity between slowpoisoners.com and the page at bootlegbooks.com
> being falsely ascribed.
>
> On the other hand, if the top pages returned were two links that were in
> fact similar and of interest to the reader, then the user might very well
> spend a lot of time at the first link before moving on to the second link.
> The techniques described in the paper would result in a false conclusion
> that the two links are not as similar as slowpoisoners.com and the
> bootlegbooks.com page.
>
> Basically, the technique assumes every document retrieval is significant.
> However, in many situations, if retrieving documents is simple (which it
> hopefully is), most document retrievals will not be significant.  The user
> might quickly skim through a half dozen or more documents that are not
> what they are looking for before finding the one that is.  In this
> situation, the relationship weight ascribed by the techniques in the paper
> will tell you much more about your document search engine and your user's
> savvy with using that search engine than about the documents itself.  Or
> it might not.  It's difficult (impossible?) to know.  It also might (or
> might not) tell you more about the hyperlink structure at your site than
> about the relationships of document content.  Again, it is difficult
> (perhaps impossible) to know.
>
> If you have a set of documents that are retrieved solely by a method that
> allows the user to request, "the paper entitled 'FooBar' from the _Journal
> of FooBarOlogy_ volume 3 number 4, by Smith, Trott, and Wesson," then the
> system described in the paper may work exceptionally well.  However,
> broader application is questionable in my view.
>
> Or am I being naive and missing something crucial?
>
> Rich

-- 
***********************************
* Johan Bollen
* Computer Science Department
* Old Dominion University
* Norfolk VA 23538
* tel: 757 683 6392
* URL: http://www.cs.odu.edu/~jbollen
*******************************************

Current Thread