Full Paper (25 pages)
Baoning Wu and Brian D. Davison
Link farm spam and replicated pages can greatly deteriorate link-based ranking algorithms like HITS. In order to identify and neutralize link farm spam and replicated pages, we look for sufficient material copied from one page to another. In particular, we focus on the use of "complete hyperlinks" to distinguish link targets by the anchor text used. We build and analyze the bipartite graph of documents and their complete hyperlinks to find pages that share anchor text and link targets. Link farms and replicated pages are identified in this process, permitting the influence of problematic links to be reduced in a weighted adjacency matrix. Experiments and user evaluation show significant improvement in the quality of results produced using HITS-like methods.
Technical Report LU-CSE-06-007, Dept. of Computer Science and Engineering, Lehigh University, April, 2006.
This report replaces LU-CSE-05-014, noting that our approach to finding bipartite components is an approximation, and also compares performance to a simplified method. An abridged version of this report was published in the Proceedings of the 21st ACM Symposium on Applied Computing. Please cite that version instead.
Back to Brian Davison's publications