Program Proposal for Ph.D. in Computer Science Voracious Classifiers Michael A. Moll May 8, 2006 Computer Science & Engineering Dept Lehigh University, Bethlehem, PA The Digital Age has not led to the paperless office. In fact the Internet now offers us vast collections of images of paper documents, only a few of which have had their content symbolically encoded. Existing automatic methods for interpreting document images are frustratingly limited: they don't cope with the full diversity of document and content types, and they are incapable of being trained on the vast collections of images being scanned in (by Google, Amazon, and others). Classifiers built on static k-d trees or CARTs may perform well with a few hundred thousand training samples, but choke on the billions of training samples we expect will be necessary for high accuracy. Thus I propose a research program to achieve high speed classification of document image content utilizing extremely large-scale data sets. Regardless of the algorithms or implementations I use, my emphasis will be on versatility: that is, useful classification across the broadest feasible range of document types. For several reasons, k-Nearest Neighbors (kNN) classifiers are attractive; but they suffer from high computational cost. Promising approaches to fast approximate kNN include hashed k-d trees and locality sensitive hashing, which exploit the skewed distributions that real distributions often exhibit. I will investigate these and other approaches and attempt to achieve further speedups exploiting isogeny, which is the tendency of real-world instances of classification problems to be locally highly consistent.