Lehigh University
COLLEGE HOME | LEHIGH HOME | SEARCH



•  Home
•  Research
•  Courses
•  Publications
•  Patents
•  Talks
•  Professional
•  Conferences
•  People
•  Student Resources
•  Other Activities
•  Vita (PDF)


•  PatRec Lab


   


Daniel P. Lopresti:  Noisy Text Data



A recent paper I wrote, now under submission, examines the impact of OCR errors on a large collection of scanned pages I constructed specifically for this task and which I am making available to the international research community to help foster similar studies. This data set is derived from the well-known Reuters-21578 news corpus.

"Split-000" of the Reuters corpus was first filtered to remove articles that consist primarily of tabular data that would be inappropriate to parse using NLP techniques (see “Medium-Independent Table Detection” (PDF 301 kbytes)). Then I formatted each of the remaining articles as a single page typeset in Times-Roman 12-point font. In doing so, I discarded articles that were either too long to fit on a single page or too short to provide a good test case (fewer than 50 words). Of the 925 articles in the original set, 661 remained after these various criteria were applied.

These pages were then printed on a Ricoh Aficio digital photocopier and scanned back in using the same machine at a resolution of 300 dpi bitonal using the copier's automatic sheet feeder. One set of pages was scanned as-is ("orig"), another two sets were first photocopied through one and two generations with the contrast set to the darkest possible setting ("dark1" and "dark2"), and two more sets were similarly photocopied through one and two generations at the lightest possible setting ("light1" and "light2") before scanning. This resulted in a test set totaling 3,305 pages. We then ran the resulting bitmap images through the Tesseract open source OCR package.

The NLP stages I employed consist of:
  1. Sentence boundary detection using the MXTERMINATOR package by Reynar and Ratnaparkhi.
  2. Tokenization using the Penn Treebank tokenizer.
  3. Part-of-speech tagging using Ratnaparkhi's MXPOST.
Here's an example of a page from the collection (derived from Page 1 of Split-000 of Reuters-21578):
The complete set of 3,305 pages is broken into "batches" of 100-pages each (except for the last, which contains the final 61 pages), and further subdivided in terms of the five types of source document (original page, first- and second-generation dark photocopies, and first- and second-generation light photocopies). These are then collected into TAR archives and compressed using gzip. Click on the links below to download the indicated archives. Note that each of these sets is between 3 and 5 megabytes in size.
The corresponding ground-truth files for each batch are listed below. Note that the approach I use in my work is relativistic. That is, there is no universal ground-truth, but rather I compare the performance of the various text analysis stages on clean and noisy inputs. An "error" is considered to have occurred when the two sets of results differ. There may already in fact be errors present, even for clean inputs.


image


© 2004 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015