Full Paper (22 pages)
Xiaoguang Qi and Brian D. Davison
Web page classification is important to many tasks in information retrieval and web mining. However, applying traditional textual classifiers on web data often produces unsatisfying results. Fortunately, hyperlink information provides important clues to the categorization of a web page. In this paper, an improved method is proposed to enhance web page classification by utilizing the class information from neighboring pages in the link graph. The categories represented by four kinds of neighbors (parents, children, siblings and spouses) are combined to help with the page in question. In experiments to study the effect of these factors on our algorithm, we find that the method proposed is able to boost the classification accuracy of common textual classifiers from around 70% to more than 90% on a large dataset of pages from the Open Directory Project, and outperforms existing algorithms. Unlike prior techniques, our approach utilizes same-host links and can improve classification accuracy even when neighboring pages are unlabeled. Finally, while sibling pages are found to be the most important type of neighbor to use, the other types can also contribute.
Technical Report LU-CSE-06-011, Dept. of Computer Science and Engineering, Lehigh University, June, 2006.
An updated version of this report was published in the Proceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM). Please cite that version instead.
Back to Brian Davison's publications