Lehigh University

•  Home
•  Research
•  Courses
•  Prof'l Activities
•  Conferences
•  Publications
•  Recent Talks
•  Patents
•  Awards
•  Miscellaneous
•  Vita (PDF)


Henry S. Baird

Research on Document Images

in Digital Libraries

Digital libraries (DLs) promise to offer more people access to larger document collections, and at far greater speed, than physical libraries can. Increasingly, many types of DLs have grown into hybrid collections of images of paper documents mixed in with encoded data (ASCII, XML, etc; either `born digital' or accurately keyed-in). Ideally, users of DLs should be able to search, access, examine, and navigate among document images as easily, and in the same way (using the same interfaces and tools to the maximum degree possible), as they can routinely today with encoded data.

Unfortunately, DLs tend to serve poorly many types of non-digital human--legible media, including originally printed and handwritten documents. These media, in their original physical (undigitized) form, are readily --- if not always quickly --- legible, searchable, and browseable, whereas in the form of document images accessed through DLs they often lose many of their original advantages while of course lacking many advantages of symbolically encoded information. Difficult open technical problems in document image analysis (DIA) arise in the construction and use of DLs, e.g. due to the contrasting advantages of paper and digital displays, and at every stage of capture, early processing, recognition, analysis, presentation, & retrieval -- also in personal and interactive applications.

Prof. Lopresti and I are focusing our research on scientific and engineering documents, especially large format documents containing text, mathematical notation, line-art, and other images, which are particularly awkward to make use of in DLs today. We are fortunate to have access to fascinating Lehigh University archives of exactly this sort, acquired during its long history of one of America's premier engineering education universities. For example, Lehigh's `Digital Bridges' DL is a unique collection of bridge engineering documents and plans going back to the 19th century. Only a small fraction of what is available is currently online, however, due in some degree to technical challenges we hope to address.

Some highlights of our research agenda:

  1. research into user-interactions with displays during reading and browsing;
  2. investigation of goal-directed metrics of document image quality}, tied quantitatively to the reliability of downstream processing (both machine and human) of the images;
  3. better methods for display and navigation within large document images} using the present generation of electronic displays as well as very large (e.g. wall-sized projection) displays, plus investigation of versatile document-image tiling, alignment, and superposition algorithms;
  4. investigate the effectiveness of ``first OCR, then IR'' methods on short passages such as, in an extreme but practically important case, fields containing key metadata (such as title, author, etc);
  5. study the implications of this phenomenon for information extraction from scanned documents in digital libraries;
  6. investigate methods for searching within and indexing into large documents containing text, line-art, and other images.

Although document image analysis (DIA), in the form of OCR machines able to convert to text clean images of machine-printed text in Western languages, is to some extent a mature technology, the problems arising in educational DLs --- which must embrace many nations, cultures, and historical periods, and contain many types of non-textual contents --- defeat the best present-day DIA techniques. Solving these challenging open problems is, we believe, an urgent priority to avoid the neglect, be default, of the world's vast irreplaceable cultural legacy collections of paper documents, in an unconsidered rush to a hegemony of `born digital' data collections.


© 2003 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015