Lehigh University

•  Home
•  Research
•  Courses
•  Publications
•  Patents
•  Talks
•  Professional
•  Conferences
•  People
•  Student Resources
•  Other Activities
•  Vita (PDF)

•  PatRec Lab


Daniel P. Lopresti:  Noisy Text

Errors are unavoidable in advanced computer vision applications such as optical character recognition (OCR), and the noise induced by these errors presents a serious challenge to downstream processes that attempt to make use of such data. Some of my work involves developing techniques to measure the impact of recognition errors on the NLP stages of a standard text analysis pipeline: sentence boundary detection, tokenization, and part-of-speech tagging. I have developed a methodology that formulates error classification as an optimization problem solvable using a hierarchical dynamic programming approach, and used this technique to analylze OCR errors and their cascading effects as they travel through the pipeline. Listed below are papers that describe this work:

“Optical Character Recognition Errors and Their Effects on Natural Language Processing,” D. Lopresti, Proceedings of the ACM SIGIR Workshop on Analytics for Noisy Unstructured Text Data, July 2008, Singapore, pp. 9-16.

“Measuring the Impact of Character Recognition Errors on Downstream Text Analysis,” D. Lopresti, Proceedings of Document Recognition and Retrieval XV (IS&T/SPIE International Symposium on Electronic Imaging), January 2008.

“Performance Evaluation for Text Processing of Noisy Inputs,” D. Lopresti, Proceedings of the 20th Annual ACM Symposium on Applied Computing (Document Engineering Track), March 2005, Santa Fe, NM, pp. 759-763.  (Abstract)  (PDF 83 kbytes)

“Summarizing Noisy Documents” (with H. Jing and C. Shih), Proceedings of the Symposium on Document Image Understanding Technology, April 2003, Greenbelt, MD, pp. 111-119.  (PDF 97 kbytes)

I am also co-chair of the Workshops on Analytics for Noisy Unstructured Text Data. The first workshop, AND 2007, was held in Hyderabad India in January 2007 in conjunction with the Twentieth International Joint Conference on Artificial Intelligence (IJCAI). The second workshop, AND 2008, was held in Singapore in July 2008 in conjunction with the Thirty-first Annual International ACM SIG-IR Conference. And the third workshop, AND 2009, was held in Barcelona in July 2009 in conjunction with the Tenth International Conference on Document Analysis and Recognition (ICDAR).

Noisy Text Dataset

A recent paper I wrote, presented at the 2008 AND Workshop, examines the impact of OCR errors on a large collection of scanned pages I constructed specifically for this task and which I am making available to the international research community to help foster similar studies. This data set is derived from the well-known Reuters-21578 news corpus.

"Split-000" of the Reuters corpus was first filtered to remove articles that consist primarily of tabular data that would be inappropriate to parse using NLP techniques (see “Medium-Independent Table Detection” (PDF 301 kbytes)). Then I formatted each of the remaining articles as a single page typeset in Times-Roman 12-point font. In doing so, I discarded articles that were either too long to fit on a single page or too short to provide a good test case (fewer than 50 words). Of the 925 articles in the original set, 661 remained after these various criteria were applied.

These pages were then printed on a Ricoh Aficio digital photocopier and scanned back in using the same machine at a resolution of 300 dpi bitonal using the copier's automatic sheet feeder. One set of pages was scanned as-is ("orig"), another two sets were first photocopied through one and two generations with the contrast set to the darkest possible setting ("dark1" and "dark2"), and two more sets were similarly photocopied through one and two generations at the lightest possible setting ("light1" and "light2") before scanning. This resulted in a test set totaling 3,305 pages. We then ran the resulting bitmap images through the Tesseract open source OCR package.

The NLP stages I employed consist of:
  1. Sentence boundary detection using the MXTERMINATOR package by Reynar and Ratnaparkhi.
  2. Tokenization using the Penn Treebank tokenizer.
  3. Part-of-speech tagging using Ratnaparkhi's MXPOST.
Here's an example of a page from the collection (derived from Page 1 of Split-000 of Reuters-21578):
The complete set of 3,305 pages is broken into "batches" of 100-pages each, and further subdivided in terms of the five types of source document (original page, first- and second-generation dark photocopies, and first- and second-generation light photocopies). These are then collected into TAR archives and compressed using gzip. Click on the links below to download the indicated archives. Each of these sets is between 3 and 5 megabytes in size. (Note:  I have found that certain browsers have the annoying habit of replacing the ".tgz" suffix with a ".tar" suffix, which may cause problems when you try to decompress the set. If this happens, you should either rename the downloaded archive to have the proper suffix, or switch to using another browser.)


© 2004 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015