Henry S. Baird
Research on High-Performance
Image Understanding Systems
The most promising strategy for improving the performance of image understanding systems
by the orders of magnitude that are needed is, I believe, to aim for versatility first
For decades the machine vision R&D community has optimized for high speed,
and for high accuracy on some (often only a small) fraction of the input images,
but only later --- if at all --- for versatility, by which I mean guaranteed competence over a
broad and precisely specified class of images.
As a result, vision technologies still fall far short of both human abilities and users' needs:
they are overspecialized, brittle, unreliable, and improving only with painful slowness.
A versatility-first vision research program begins when we select a broad, challenging
family of images: an example from the application domain of Digital Libraries
might be all printed documents potentially containing any of many languages, scripts,
page layout styles, and image qualities; an example from Biomedicine
might be photomicrographs of certain tissue types. Then, we investigate ways to:
- capture as much as possible of these images' variety in a formal generative
(often stochastic) model that combines several submodels, e.g. of image quality, or cell boundary shapes
(this is a critically important step requiring both analytical rigor and sophisticated statistical modeling);
- develop methods for inferring the parameters of such models from
labeled training data (can be difficult even though there is a large relevant literature);
but, once this is done, the system can be retrained (`retargeted') by clerical labor.
- design provably optimal recognition algorithms,
for each submodel, and for the system as a whole,
for best possible results w.r.t. the models (an intellectual challenge but sometimes doable);
- (only then) reduce runtimes to practical levels, carefully without loss of generality
(this may require inventions but is almost always possible, in my experience);
- organize the system to adapt its model parameters to unlabeled test data,
on the fly, and so retrain itself with a minimum of manual assistance (progress on this, recently, at
RPI, Bell Labs, & PARC); and
- construct `anytime' recognition systems which, when allowed to run
indefinitely, are guaranteed to improve accuracy monotonically to the best achievable, i.e.
consistent with the Bayes error of the problem (a daunting, exciting, as yet almost untouched problem domain).
My experience inventing, building, testing, patenting, and applying systems of
this type has convinced me of their promise --- successes so far include:
- a world record in accuracy (99.995% characters correct)
achieved by exploiting semantic as well as syntactic models of
image content (w/ Ken Thompson);
- a page reader that is quickly and easily `retargetable' to new languages
including Japanese, Bulgarian, and Tibetan (w/ David Ittner, Tin Ho, & others);
- an automatically self-correcting classifier that cuts its own error rate by large factors
without retraining, given merely a single hint (w/ George Nagy);
- a high-accuracy tabular-data reader that, with only 15 minute's clerical effort,
can be trained to a new table-type, applied to over 400 different forms (w/ Tom Wood & John Shamilian);
- a printed-text recognition technology, trainable with low manual effort, that maintains
uniformly high accuracy over an unprecedentedly broad range of image qualities
(w/ Gary Kopec & Prateek Sarkar); and
- world-class web security technology (CAPTCHAs)
able to block programs ('bots, spiders, etc) from abusing web services,
by means of automated Turing tests that exploit the gap in ability between humans
and machines in reading degraded images of text (w/ Allison Coates, Richard Fateman, Monica Chew et al).
I'm acutely aware of the many outstanding technical obstacles: further progress will require a
multidisciplinary research program grounded in theory, algorithms, and experimental systems research.
Relevant disciplines include computer vision, pattern recognition, machine learning, mathematical statistics,
optimization algorithms, image and signal processing, computational geometry,
cognitive science, psychology, and software systems engineering.
Document understanding systems have provided technically fertile ground for versatility-first research;
they also enjoyed commercial impact in the companies where I worked;
funding for them flows from NSF, ARDA, DARPA, DOD agencies, DOE, FBI, etc; and
they are relevant to the present intelligence--analysis crisis.
However, it should be clear that the methodologies I am committed to have important
implications also for many problem domains.
For example, they enable data mining, networking, and distribution
of both legacy and transient multimedia content.
From time to time, my research has taken an astonishing serendipitous turn.
For example, my theoretical investigation of stochastic models of
image degradation, and of their use in mapping the domain of competency of classifiers,
recently allowed the design of some of the world's best Turing-test-like
security protocols able to distinguish machine from human users over the Web.