My research has spanned a number
of different areas over the years, but generally has been driven by
fundamental issues arising in pattern recognition and its application
to real-world problems.
Electronic Voting Systems
For a summary of some of my recent work on electronic voting, please
visit the PERFECT project website
here.
Noisy Text Analytics
Errors are unavoidable in advanced
computer vision applications such as optical character recognition
(OCR), and the noise induced by these errors presents a serious
challenge to downstream processes that attempt to make use of such
data. Some of my work involves developing techniques to measure the
impact of recognition errors on the NLP stages of a standard text
analysis pipeline: sentence boundary detection, tokenization, and
part-of-speech tagging. I have developed a methodology that formulates
error classification as an optimization problem solvable using a
hierarchical dynamic programming approach, and used this technique to
analylze OCR errors and their cascading effects as they travel through
the pipeline. Listed below are papers that describe this work:
“Measuring the Impact of Character Recognition Errors on Downstream
Text Analysis,” D. Lopresti,
Proceedings
of Document Recognition and Retrieval XV (IS&T/SPIE International
Symposium on Electronic Imaging), January 2008.
“Performance Evaluation for Text Processing of Noisy Inputs,” D.
Lopresti,
Proceedings of the 20th
Annual ACM Symposium on Applied Computing (Document Engineering Track),
March 2005, Santa Fe, NM, pp. 759-763. (
Abstract)
(
PDF
83 kbytes)
“Summarizing Noisy Documents” (with H. Jing and C. Shih),
Proceedings
of the Symposium on Document Image Understanding Technology, April
2003, Greenbelt, MD, pp. 111-119. (
PDF
97 kbytes)
I am also co-chair of the Workshops on Analytics for Noisy Unstructured
Text Data. The first workshop,
AND 2007, was held in
Hyderabad India in January 2007 in conjunction with the Twentieth
International Joint Conference on Artificial Intelligence (IJCAI). The
second workshop,
AND
2008, will be held in Singapore in July 2008 in conjunction with
the Thirty-first Annual International ACM SIG-IR Conference.
A noisy text dataset I have made available to the international
research community can be found
here.
Bioinformatics
My thesis work at Princeton under Dick Lipton (now at Georgia Tech) was
some of the earliest in the area of parallel algorithms for the rapid
comparison of genetic
sequences. I built the first systolic array for solving a
combinatorial problem, P-NAC (the "Princeton Nucleic Acid
Comparator"). At Brown I continued this research with my Ph.D.
student Richard Hughey (now at UC Santa Cruz), and also co-taught a
seminar on "Computational
Aspects of Molecular and Population Biology" with Lisa Brooks of the
Biology Department. Consulting at the Supercomputing Research
Center, I implemented sequence comparison on SPLASH, a massively
parallel reconfigurable logic array (FPGA) architecture, work that was
awarded Honorable Mention in 1989 Gordon Bell Prize competition.
Microphotograph of P-NAC
Three of my more recent theoretical results also have potential
applications in bioinformatics. Working with Andrew Tomkins (now
at IBM Almaden), I introduced the notion of
block edit distance, which imposes an added layer of structure on top
of traditional string matching; two strings are compared by optimally
extracting sets of substrings and placing them into
correspondence. These could, for example, represent biological
"motifs" that are shared between two genetic sequences, but that appear
in an unknown order. Depending on whether the strings must be
matched
in their entireties and/or whether overlap is forbidden between the
substrings,
we showed that the problem is either NP-complete or solvable in
polynomial time using our algorithms. (A paper describing this
work can be found
here.)
Cross-domain approximate string matching, work I did with Gordon
Wilfong at Bell Labs, extends traditional string matching in another
way, by
allowing strings to be specified compared under fundamentally
different edit models that possess their own distinct alphabets and
cost functions, which we call domains.
Transcription rules are used to map the strings from one domain into
another. The genetic code is one such transcription process, mapping
from triples of nucleotides (codons) to the amino acids that make up
proteins. Given two RNA sequences, one might naturally wonder how
similar the proteins they code for are. Taking this one step
further, it is easy to imagine each of the RNA sequences first
undergoing pre-processing to correct for possible "noise" effects that
may have occurred for any of a number of reasons, biological or
otherwise (e.g., errors in reading the sequencing gels). Our
work formulates this computation as a unified optimization problem and
solves it using a new dynamic programming algorithm. (A paper
describing this work can be found
here.)
Gordon and I have also recently started studying an easy-to-compute
measure for quantifying the similarity between two arbitrary graphs,
which we call graph probing distance. We have proven that this
distance yields a lower bound on the true edit distance between any two
graphs, and have conducted experiments that suggest it provides a
reliable approximation to graph edit distance. Moreover, our
technique
is very fast; graphs with tens or hundreds of thousands of vertices can
be compared in mere seconds. Since certain problems in
bioinformatics involve comparing large graphs, it seems quite possible
that our results could be applied there.
Security Applications Involving Speech
With the current broad-based interest in security, speech is becoming a
popular biometric. Within the past year I have initiated two
collaborations on unconventional uses for speech.
The first of these involves algorithmic techniques for distinguishing
between humans and machines, a problem sometimes called the "Reverse
Turing Test" (RTT) or "Human Interactive Proof" (HIP). The growth
of the Internet as a medium for distributing valuable content has made
it an attractive target for hackers who write malicious programs
("bots") that attempt to exploit online services. As a result,
service providers have begun instituting automatic methods for telling
whether the entities attempting to access their websites are humans or
machines. Current RTT's are predominately graphical.
Realizing that speech-based services are proliferating and that
sensitive applications such as bank-by-phone could someday come under
attack by bots built to navigate spoken language interfaces, I have
adapted the notion of an RTT to the speech domain working with Chilin
Shih (now at University of Illinois, Urbana) and Greg Kochanski (now at
Oxford). (A paper describing this work can be found
here.)
In another security-related application, I have begun working with
Fabian Monrose (now at Johns Hopkins), Mike Reiter (now at Carnegie
Mellon), Peter Li (now at Li Creative Technologies), and Susanne Wetzel
(now at Stevens Institute of Technology) on a concept for generating
cryptographically secure keys from speech. There
are two basic methodologies for attacking such a system. The
standard approach is to attempt to reconstruct the key, perhaps with
the help of cryptanalysis techniques. A more novel line of attack
is to try to synthesize the passphrase based on surreptitious audio
recordings of the user speaking other material. Recognizing this
fact, I have been attempting to characterize the potential
effectiveness of such attacks. Doing so will help us achieve a
better understanding of the security of the proposed method and develop
ways of making it stronger. (A paper describing this work can be
found
here.)
Document Analysis & Digital
Libraries
My research in document analysis
relates to making intelligent use of
the information that is contained in paper and electronic
documents. I am interested in developing techniques that are
robust in the presence of errors that invariably arise during
recognition processes (both low- and high-level), with a variety of
potential end-user applications including retrieval, summarization,
data mining, transcoding into speech, etc.
Tables, for example, are an
important means for communicating structured data. Well-behaved
tables can be viewed as relations in the database sense of the term,
but most often the real world is not this accommodating. With
colleagues Jianying Hu (now at IBM Yorktown Heights), Ram Kashi (now at
Avaya Research), and Gordon Wilfong, I have studied
ways of locating and extracting tabular information so that it can be
reformulated for other uses (e.g., for querying via a spoken language
interface).
There is also an intriguing
connection between this area and my interest in bioinformatics, as
researchers have begun to consider ways of integrating the information
contained in many disparate written sources (journal articles, database
annotations, MEDLINE entries, etc.) that all make mention of, say, the
same gene.
Performance evaluation has been
another focus of my work in recent years, as has the fundamental nature
of ground-truth, the reference by which recognition results are
judged. In a recent paper, George Nagy (at Rensselaer Polytechnic
Institute) and I proposed that user
bias, context, and data-entry tools can play significant roles in
determining truth, and that allowances should be made for more than one
admissible ground-truth. (A paper describing this work can be
found here.)
The World Wide Web
The World Wide Web is, of course,
the largest distributed collection
of documents in the history of civilization, and has given rise to a
diverse set of new problems. I have looked at three in
particular. The first involves the extraction of text embedded in
color GIF and JPEG images, information which is overlooked by current
commercial search engines. Indeed, Jiangying Zhou (now at Summus
Ltd.) and I authored
the first papers on this subject beginning in 1996.
Another topic of interest to me is
the comparison of the graph structures induced by hyperlinked
documents, an application of the graph probing paradigm I developed
with Gordon Wilfong. Possible applications include
query-by-structure, information extraction, "wrapper" maintenance, etc.
Very recently, I have begun
exploring a framework for the real-time adaptive delivery of web images
to resource-constrained devices (e.g., PDA's on a wireless network)
working with Yunnan Wu, a current graduate student at Princeton.
Our approach integrates techniques from image analysis, data
compression, rate-distortion optimization,
and user interaction, yielding a scaleable representation that provides
the best possible image for a given screen size and network bandwidth
("bit budget"). (A paper describing this work can be found here.)