Classifiers Without Borders: Incorporating Fielded Text From Neighboring Web Pages

Xiaoguang Qi and Brian D. Davison

Full Paper (8 pages)
Official ACM published version: http://doi.acm.org/10.1145/1390334.1390443
Author's copy: PDF (366KB)

Abstract

Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document.

In Proceedings of the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, pages 643-650, Singapore, July 2008.

Back to Brian Davison's publications

Last modified: 29 July 2008 Brian D. Davison