Human Performance on Clustering Web Pages
Technical Report (19 pages)
PDF (87kb)
Sofus A. Macskassy,
Arunava Banerjee,
Brian D. Davison,
and
Haym Hirsh.
August 1998
Abstract
With the increase in information on the World Wide Web it has become
difficult to quickly find desired information without using multiple
queries or using a topic-specific search engine. One way to help in
the search is by grouping HTML pages together that appear in some way
to be related. In order to better understand this task, we performed
an initial study of human clustering of web pages, in the hope that it
would provide some insight into the difficulty of automating this
task. Our results show that subjects did not cluster identically; in
fact, on average, any two subjects had little similarity in their
web-page clusters. We also found that subjects generally created
rather small clusters, and those with access only to URLs created fewer
clusters than those with access to the full text of each web page.
Generally the overlap of documents between clusters for any given
subject increased when given the full text, as did the percentage of
documents clustered. When analyzing individual subjects, we found
that each had different behavior across queries, both in terms of
overlap, size of clusters, and number of clusters. These results
provide a sobering note on any quest for a single clearly correct
clustering method for web pages.
Technical Report DCS-TR-355, Department of Computer Science, Rutgers
University.
A shorter version of this paper is available as a
conference paper.
Back
to Brian Davison's publications
Last modified: 29 August 2000
Brian D. Davison