|
|
Henry S. Baird Research on DICE: Document Image Content Extraction --- Supported by the Defense Research Projects Agency (DARPA)
Information Processing Technology Office (ITP) --- Given an image of a
document,
find regions ('zones') containing handwriting, machine-print text, graphics, line-art, logos, photographs, noise, etc. Solutions to this problem have applications in digital libraries, web search, intelligence analysis, and office automation. DICE Project members (Summer 2005): Michael Moll
& continuing into the Fall '06 with:Jean Nonnemaker Matt Casey Don Delorenzo Henry Baird Michael Moll Matt Casey These papers report early results: ``Towards Versatile Document Analsyis Systems,'' (w/ M. R. Casey) Proc., 7th IAPR Document Analysis Workshop (DAS'06), Nelson, New Zealand, February 12-15, 2006. [PDF, PPT] ``Versatile Document Content Extraction,'' (w/ M. A. Moll, J. Nonnemaker, M. R. Casey, D. L. Delorenzo) Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 18-19, 2006. [PDF, PPT] ``Distinguishing Mathematics Notation from English Text using Computational Geometry,'' (w/ D. Drake) Proc., IAPR 8th Int'l Conf. on Document Analysis and Recognition (ICDAR2005), Seoul, Korea, August 31 - September 1, 2005. [PDF, PPT]More details below..... Challenges include:
Team members Michael Moll, Jean Nonnemaker, Matt Casey, Don Delorenzo, & Henry Baird gather for the first time. Software engineering decisions: We'll code in C++/C compile, run under Linux (Fedora Core IV) man pages / documentation: Doxygen: special comments added to source code can generate Unix man pages, LaTeX, HTML, ... version control: SubVersion generally we'll try to reuse exiusting open-source code (under, e.g. GPL-like licenses) also we'll try to write code that might be palatable to the open source community Top-down refinement of tasks User view: an end-user sees this functionality: DICE-classifier: INPUT: any image of any document (page) OUTPUT: regions classified by their content Kinds of document images accepted (ideally) Color Grey-level B&W (bilevel) --- any size (hgt x wid in pixels) any "resolution" (pixels/inch), spatial sampling rate -- any of a wide range of image file formats, e.g. TIFF JPEG PNG CCITT G3/G4 ... We choose to convert all input image file formats to a PNG file format with pixels encoded with three components in the HSL color space: Hue Saturation Luminance the infomation in bilevel and grey images map into the luminance component (There are color spaces in which Euclidean distance correlates well with human perception of color difference; we didn't pursue this....) We decided to use one byte per channel, on the grounds that more levels aren't needed in this application (but George Nagy commented that we would need more to handle Xrays, standard color test targets, etc). We will use another 8-bit field to encode "content class" (CTCL). Some files will possess this channel in addition to the HSL channels: such a file will be called an HSLC file. How shall "regions" be described? the option that dominates the literature is: A. Rectangular orthogonal (with X-axis & Y-axis parallel sides) boxes, indicating "zones", labeled with class. Semantics: every pixel within the box is of that class. Advantages: - most preexisting ground-truthed image data is described this way; and - most methods for scoring classifier success use them. Problems: what if true zones have slanted or curved sides? B. Interesting and attractive alternative option: label each *pixel* with the class. Advantages: - easy to visualize; - trivial to score; and - not dependent on an arbitrary and restrictive class familty of zone shapes. Decision: we'll use per-pixel ground-truth, and furthermore the classifier will label pixels with classes. Caveat: if the rest of the world cares a lot, we can develop a separate process to merge these into rectangular zones. Shall we allow zones to overlap? i.e. can a region/pixel possess more than one class? Decision (05may26): yes, but we'll allow Classes: kinds of regions to be distinguished (ideally) MP - machine printed text TA - tabular data MA - mathematics notation HD - header FT - footer HW - handwriting TA - tabular data MA - mathematics notation etc LA - line-art graphics PH - continuous-tone photographs JK - junk: e.g. margin and gutter noise BL - blank: none of the above Decision: we will focus first on MP, HW, PH, BL For this summer, we'll allow a maximum of 15 classes total and each region can have zero, one, or two classes This will be encoded in an 8-bit class channel: 4-bit class 1 0 encodes "unclassified" 4-bit class 2 Zero classes is implied by the absence of the class component. Zero classes for a particular pixel can mean that the pixel is not yet classified or that the classifier *cannot* classify it for any reason (e.g. it is too close to the edge of the page) Engineer's view of DICE: DICE-training: INPUT: a set of ground-truthed doc images OUPUT: a DICE classifier A DICE classifier may take the form of 'tableware' that a program interprets or, it may be a large set of files possible a huge directory full of large files .. we don't know yet Both the DICE-classifier and the DICE-training use this important stage for extracting "features" of pixels. DICE-features: INPUT: a set of doc-image files OUTPUT: a set of pixel locations each with a list of numerical features How many features? What's the dimension 'd' of the data? don't know yet: maybe hundreds Dimension 'd' is fixed for all stages of the DICE system Programs will be written to handle *any* value of 'd' 'd' will be determined by the feature extractor code DICE-testing - run the DICE-classifier on doc-images w/ gt - compare classifier output with the gt - compute classifier accuracy scores - summarize, visualize, & analyze Collection of training and text data - find sets of doc-images with labeled - download them - convert images to our internal format - convert gt to our internal format - maybe, if we can't borrow enough, create our own by "amplification": generate synthetic data that differs in a controlled way from "real" training and test data, using models of common variations Programming guidelines (not binding) try to keep the executables simple, i.e. stdin/stdout where possible (no complex file handling) handle directory trees in shell scripts (e.g. PERL, Python, bash) Features file probably: for each pixel, a list of 'd' numerical values (d ~= 15-20) file format class f1 f2 f3 ... f<d> // one ASCII line per pixel e.g. MP 2 10 554 30.2 100.1 .... // Note: maybe should implement continuation, e.g. possible features: H S & L values for the pixel itself each of the neighboring 8 pixels (or, 4 pixels) neighboring summed 3x3 neighborhoods (down-res'ed x9) maybe drop the H & S components, keep only L maybe use Haar transforms maybe use wavelets .... maybe normalize digitizing resolution (pixels/inch) to a fixed conventional value, e.g. 200 ppi HSLC file format a legal variant on PNG http://www.libpng.org/pub/png/book/ Further records of decisions, algorithms, experiments are continued on our project WIKI site: http://snake-eyes.cse.lehigh.edu/wiki Feel free to take a look! |
|||||||||||||||||||||||||||||||||||||||
|