Vetting the Links of the Web

Na Dai and Brian D. Davison

Short Paper (4 pages)
Official ACM published version: http://doi.acm.org/10.1145/1645953.1646220
Author's version: PDF (140KB)

Abstract

Many web links mislead human surfers and automated crawlers because they point to changed content, out-of-date information, or invalid URLs. It is a particular problem for large, well-known directories such as the dmoz Open Directory Project, which maintains links to representative and authoritative external web pages within their various topics. Therefore, such sites involve many editors to manually revisit and revise links that have become out-of-date. To remedy this situation, we propose the novel web mining task of identifying outdated links on the web. We build a general classification model, primarily using local and global temporal features extracted from historical content, topic, link and time-focused changes over time. We evaluate our system via five-fold cross-validation on more than fifteen thousand ODP external links selected from thirteen top-level categories. Our system can predict the actions of ODP editors more than 75% of the time. Our models and predictions could be useful for various applications that depend on analysis of web links, including ranking and crawling.

In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1745-1748, Hong Kong, November 2009. ACM Press.

Back to Brian Davison's publications

Last modified: 9 November 2009
Brian D. Davison