Knowing a Web Page by the Company It Keeps

Xiaoguang Qi and Brian D. Davison

Full Paper (10 pages)
Official ACM published version: http://dx.doi.org/10.1145/1183614.1183650
Author's version: PDF (447KB)

Abstract

Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while all neighbor types can contribute, sibling pages are found to be the most important.

In Proceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM), pages 228-237, Arlington, VA, November 6-11, 2006.

Back to Brian Davison's publications

Last modified: 7 July 2011
Brian D. Davison