Full Paper (6 pages)
Official ACM published version: http://doi.acm.org/10.1145/1141277.1141535
Author's copy: PDF (108KB)
Link farm spam and replicated pages can greatly deteriorate link-based ranking algorithms such as HITS. In order to identify and neutralize link farm spam and replicated pages, we look for sufficient material copied from one page to another. In particular, we focus on the use of "complete hyperlinks" to distinguish link targets by the anchor text used. We build and analyze the bipartite graph of documents and their complete hyperlinks to find pages that share anchor text and link targets. Link farms and replicated pages are identified in this process, permitting the influence of problematic links to be reduced in a weighted adjacency matrix. Experiments and user evaluations show significant improvement in the quality of results produced using HITS-like methods.
In Proceedings of the The 21st ACM Symposium on Applied Computing (SAC), pp. 1099-1104, Dijon, France, April 2006.
© ACM, 2006. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
A longer version of this paper has been released as Technical Report LU-CSE-06-007, Dept. of Computer Science and Engineering, Lehigh University, April, 2006. The technical report also notes that our approach to finding bipartite components is an approximation, and also compares performance to a simplified method.
Back to Brian Davison's publications