Full Paper (8 pages)
Official ACM published version: http://doi.acm.org/10.1145/1390334.1390443
Author's copy: PDF (366KB)
Accurate web page classification often depends crucially on information gained from neighboring pages in the local web graph. Prior work has exploited the class labels of nearby pages to improve performance. In contrast, in this work we utilize a weighted combination of the contents of neighbors to generate a better virtual document for classification. In addition, we break pages into fields, finding that a weighted combination of text from the target and fields of neighboring pages is able to reduce classification error by more than a third. We demonstrate performance on a large dataset of pages from the Open Directory Project and validate the approach using pages from a crawl from the Stanford WebBase. Interestingly, we find no value in anchor text and unexpected value in page titles (and especially titles of parent pages) in the virtual document.
In Proceedings of the 31st Annual International ACM SIGIR Conference on Research & Development on Information Retrieval, pages 643-650, Singapore, July 2008.
© ACM, 2008. This is the author's version of the work. It is posted here by permission of ACM for your personal use. Not for redistribution.
Back to Brian Davison's publications