Fast Approximate Nearest Neighbors

                       Matthew R. Casey

         Submitted May 2006 in partial satisfaction of
   the requirements for the M.S. degree in Computer Science
          Computer Science and Engineering Department
          Lehigh University, Bethelehem, Pennsylvania

An investigation into algorithms for fast approximate nearest-neighbors (kNN)
classification is reported.  The principal results are empirical performance
evaluation and analysis of a family of non-adaptive k-d tree data structures
sped up by means of hash techniques.  This research is motivated by a compelling
need in pattern recognition theory and practice for classification methods
trainable in reasonable time on large data sets (say, many millions of samples)
while allowing extremely fast classification (ideally, at I/O rates).
Experiments were carried out on data arising in digital libraries:  images of
pages of books containing several types of content (machine-print text,
handwriting, photographs, etc).  The particular problem was to classify
every pixel in these images into one of the content types.  Our choosing to
classify pixels (rather than regions) gave us ready access to ground-truthed
data sets of dauntingly large size.  The non-adaptive k-d tree we used only
approximates true kNN but it allows, in principle, high speed search.  The
growth of the k-d tree data structure on these data was not exponential in
the size of the training set, as can happen in the worst case, but was a
low-order polynomial (approximately cubic) in practically important operating
regimes.  Inevitably, tradeoffs of accuracy for speed are drastic at the
extremes:  the coarsest data structures run at brute-force (exhaustive search
speeds), whereas finer-grained data structures allow speed-ups of several orders
of magnitude.  Happily, large speed-ups (of factors of 100 or more) do not
always sacrifice much accuracy.  The investigation includes testing these
methods on a variety of data sets, as well as several variations of the
classification technique.