|
|
Henry S. Baird Research on Document Images in Digital Libraries
Digital libraries (DLs) promise to offer more people access to larger document
collections, and at far greater speed, than physical libraries can.
Increasingly, many types of DLs have grown into hybrid collections of images of paper
documents mixed in with encoded data (ASCII, XML, etc; either `born digital' or
accurately keyed-in).
Ideally, users of DLs should be
able to search, access, examine, and navigate among
document images as easily, and in the same way (using the same interfaces
and tools to the maximum degree possible), as they can routinely today with
encoded data.
Unfortunately, DLs tend to serve poorly many types of non-digital human--legible media, including originally printed and handwritten documents. These media, in their original physical (undigitized) form, are readily --- if not always quickly --- legible, searchable, and browseable, whereas in the form of document images accessed through DLs they often lose many of their original advantages while of course lacking many advantages of symbolically encoded information. Difficult open technical problems in document image analysis (DIA) arise in the construction and use of DLs, e.g. due to the contrasting advantages of paper and digital displays, and at every stage of capture, early processing, recognition, analysis, presentation, & retrieval -- also in personal and interactive applications.
Prof. Lopresti and I are focusing our research on scientific and engineering documents, especially large format documents containing text, mathematical notation, line-art, and other images, which are particularly awkward to make use of in DLs today. We are fortunate to have access to fascinating Lehigh University archives of exactly this sort, acquired during its long history of one of America's premier engineering education universities. For example, Lehigh's `Digital Bridges' DL is a unique collection of bridge engineering documents and plans going back to the 19th century. Only a small fraction of what is available is currently online, however, due in some degree to technical challenges we hope to address. Some highlights of our research agenda:
Although document image analysis (DIA), in the form of OCR machines able to convert to text clean images of machine-printed text in Western languages, is to some extent a mature technology, the problems arising in educational DLs --- which must embrace many nations, cultures, and historical periods, and contain many types of non-textual contents --- defeat the best present-day DIA techniques. Solving these challenging open problems is, we believe, an urgent priority to avoid the neglect, be default, of the world's vast irreplaceable cultural legacy collections of paper documents, in an unconsidered rush to a hegemony of `born digital' data collections. |
|||||||||||||||||||||||||||||||||||||||
|