Henry S. Baird
Research on DICE:
Document Image Content Extraction
--- Supported by the Defense Research Projects Agency (DARPA)
Information Processing Technology Office (ITP) ---
Given an image of a document,
find regions ('zones') containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc.
Solutions to this problem have applications in digital libraries, web search, intelligence analysis, and office automation.
DICE Project members (Summer 2005):
Michael Moll& continuing into the Fall '06 with:
These papers report early results:
``Versatile Document Content Extraction,'' (w/ M. A. Moll, J. Nonnemaker, M. R. Casey, D. L. Delorenzo) Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 18-19, 2006. [PDF, PPT]``Distinguishing Mathematics Notation from English Text using Computational Geometry,'' (w/ D. Drake) Proc., IAPR 8th Int'l Conf. on Document Analysis and Recognition (ICDAR2005), Seoul, Korea, August 31 - September 1, 2005. [PDF, PPT]
More details below.....
Michael Moll, Jean Nonnemaker, Matt Casey, Don Delorenzo,
& Henry Baird gather for the first time.
Software engineering decisions:
We'll code in C++/C
compile, run under Linux (Fedora Core IV)
man pages / documentation:
Doxygen: special comments added to source code
can generate Unix man pages, LaTeX, HTML, ...
version control: SubVersion
generally we'll try to reuse exiusting open-source code
(under, e.g. GPL-like licenses)
also we'll try to write code that might be palatable to
the open source community
Top-down refinement of tasks
User view: an end-user sees this functionality:
INPUT: any image of any document (page)
OUTPUT: regions classified by their content
Kinds of document images accepted (ideally)
any size (hgt x wid in pixels)
any "resolution" (pixels/inch), spatial sampling rate
any of a wide range of image file formats, e.g.
We choose to convert all input image file formats
to a PNG file format with pixels encoded with
three components in the HSL color space:
the infomation in bilevel and grey images map into
the luminance component
(There are color spaces in which Euclidean distance
correlates well with human perception of color difference;
we didn't pursue this....)
We decided to use one byte per channel, on the grounds
that more levels aren't needed in this application (but
George Nagy commented that we would need more
to handle Xrays, standard color test targets, etc).
We will use another 8-bit field to encode "content class"
(CTCL). Some files will possess this channel in addition
to the HSL channels: such a file will be called an HSLC
How shall "regions" be described?
the option that dominates the literature is:
A. Rectangular orthogonal (with X-axis & Y-axis
parallel sides) boxes, indicating "zones",
labeled with class.
Semantics: every pixel within the box is of that class.
- most preexisting ground-truthed image data is
described this way; and
- most methods for scoring classifier success use them.
Problems: what if true zones have slanted or
B. Interesting and attractive alternative option: label each
*pixel* with the class.
- easy to visualize;
- trivial to score; and
- not dependent on an arbitrary and restrictive class
familty of zone shapes.
Decision: we'll use per-pixel ground-truth, and furthermore
the classifier will label pixels with classes.
Caveat: if the rest of the world cares a lot, we can develop
a separate process to merge these into rectangular zones.
Shall we allow zones to overlap?
i.e. can a region/pixel possess more than one class?
Decision (05may26): yes, but we'll allow
Classes: kinds of regions to be distinguished (ideally)
MP - machine printed text
TA - tabular data
MA - mathematics notation
HD - header
FT - footer
HW - handwriting
TA - tabular data
MA - mathematics notation
LA - line-art graphics
PH - continuous-tone photographs
JK - junk: e.g. margin and gutter noise
BL - blank: none of the above
Decision: we will focus first on
MP, HW, PH, BL
For this summer, we'll allow a maximum of 15 classes total
and each region can have zero, one, or two classes
This will be encoded in an 8-bit class channel:
4-bit class 1 0 encodes "unclassified"
4-bit class 2
Zero classes is implied by the absence of the class component.
Zero classes for a particular pixel can mean that the
pixel is not yet classified or that the classifier *cannot*
classify it for any reason (e.g. it is too close to the
edge of the page)
Engineer's view of DICE:
INPUT: a set of ground-truthed doc images
OUPUT: a DICE classifier
A DICE classifier may take the form of 'tableware'
that a program interprets
or, it may be a large set of files
possible a huge directory full of large files
.. we don't know yet
Both the DICE-classifier and the DICE-training
use this important stage for extracting "features" of
INPUT: a set of doc-image files
OUTPUT: a set of pixel locations each
with a list of numerical features
How many features? What's the dimension 'd' of the data?
don't know yet: maybe hundreds
Dimension 'd' is fixed for all stages of the DICE system
Programs will be written to handle *any* value of 'd'
'd' will be determined by the feature extractor code
- run the DICE-classifier on doc-images w/ gt
- compare classifier output with the gt
- compute classifier accuracy scores
- summarize, visualize, & analyze
Collection of training and text data
- find sets of doc-images with labeled
- download them
- convert images to our internal format
- convert gt to our internal format
- maybe, if we can't borrow enough, create our own
by "amplification": generate synthetic data that
differs in a controlled way from "real" training
and test data, using models of common variations
Programming guidelines (not binding)
try to keep the executables simple, i.e.
stdin/stdout where possible (no complex file handling)
handle directory trees in shell scripts (e.g. PERL, Python, bash)
for each pixel, a list of 'd' numerical values (d ~= 15-20)
class f1 f2 f3 ... f<d> // one ASCII line per pixel
MP 2 10 554 30.2 100.1 ....
// Note: maybe should implement continuation, e.g.
H S & L values for
the pixel itself
each of the neighboring 8 pixels (or, 4 pixels)
neighboring summed 3x3 neighborhoods (down-res'ed x9)
maybe drop the H & S components, keep only L
maybe use Haar transforms
maybe use wavelets
maybe normalize digitizing resolution (pixels/inch) to a
fixed conventional value, e.g. 200 ppi
HSLC file format
a legal variant on
Further records of decisions, algorithms, experiments are continued
on our project WIKI site:
Feel free to take a look!