Looking into the Past to Better Classify Web Spam

Na Dai, Brian D. Davison, and Xiaoguang Qi

Full Paper (8 pages)
Official ACM published version: http://doi.acm.org/10.1145/1531914.1531916
Author's version: PDF (133KB)

Web spamming techniques aim to achieve undeserved rankings in search results. Research has been widely conducted on identifying such spam and neutralizing its influence. However, existing spam detection work only considers current information. We argue that historical web page information may also be important in spam classification. In this paper, we use content features from historical versions of web pages to improve spam classification. We use supervised learning techniques to combine classifiers based on current page content with classifiers based on temporal features. Experiments on the WEBSPAM-UK2007 dataset show that our approach improves spam classification F-measure performance by 30% compared to a baseline classifier which only considers current page content.

In Proceedings of the 5th International Workshop on Adversarial Information Retrieval for the Web (AIRWeb), pages 1-8, Madrid, Spain, April 2009. ACM Press.

© ACM, 2009. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.

Back to Brian Davison's publications

Last modified: 5 May 2009
Brian D. Davison