![]() |
|
|
Daniel P. Lopresti: Noisy Text Data A recent paper I wrote, now under
submission, examines the impact of
OCR errors on a large collection of scanned pages I constructed
specifically for this task and which I am making available to the
international research community to help foster similar studies. This
data set is derived from the well-known Reuters-21578
news corpus.
"Split-000" of the Reuters corpus was first filtered to remove articles that consist primarily of tabular data that would be inappropriate to parse using NLP techniques (see “Medium-Independent Table Detection” (PDF 301 kbytes)). Then I formatted each of the remaining articles as a single page typeset in Times-Roman 12-point font. In doing so, I discarded articles that were either too long to fit on a single page or too short to provide a good test case (fewer than 50 words). Of the 925 articles in the original set, 661 remained after these various criteria were applied. These pages were then printed on a Ricoh Aficio digital photocopier and scanned back in using the same machine at a resolution of 300 dpi bitonal using the copier's automatic sheet feeder. One set of pages was scanned as-is ("orig"), another two sets were first photocopied through one and two generations with the contrast set to the darkest possible setting ("dark1" and "dark2"), and two more sets were similarly photocopied through one and two generations at the lightest possible setting ("light1" and "light2") before scanning. This resulted in a test set totaling 3,305 pages. We then ran the resulting bitmap images through the Tesseract open source OCR package. The NLP stages I employed consist of:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |