Fast Approximate Nearest Neighbors Matthew R. Casey Submitted May 2006 in partial satisfaction of the requirements for the M.S. degree in Computer Science Computer Science and Engineering Department Lehigh University, Bethelehem, Pennsylvania An investigation into algorithms for fast approximate nearest-neighbors (kNN) classification is reported. The principal results are empirical performance evaluation and analysis of a family of non-adaptive k-d tree data structures sped up by means of hash techniques. This research is motivated by a compelling need in pattern recognition theory and practice for classification methods trainable in reasonable time on large data sets (say, many millions of samples) while allowing extremely fast classification (ideally, at I/O rates). Experiments were carried out on data arising in digital libraries: images of pages of books containing several types of content (machine-print text, handwriting, photographs, etc). The particular problem was to classify every pixel in these images into one of the content types. Our choosing to classify pixels (rather than regions) gave us ready access to ground-truthed data sets of dauntingly large size. The non-adaptive k-d tree we used only approximates true kNN but it allows, in principle, high speed search. The growth of the k-d tree data structure on these data was not exponential in the size of the training set, as can happen in the worst case, but was a low-order polynomial (approximately cubic) in practically important operating regimes. Inevitably, tradeoffs of accuracy for speed are drastic at the extremes: the coarsest data structures run at brute-force (exhaustive search speeds), whereas finer-grained data structures allow speed-ups of several orders of magnitude. Happily, large speed-ups (of factors of 100 or more) do not always sacrifice much accuracy. The investigation includes testing these methods on a variety of data sets, as well as several variations of the classification technique.