Henry Baird's Research on Document Image Classification


	•	Home

	•	Research

	•	Courses
	•	Prof'l Activities
	•	Conferences

	•	Publications
	•	Recent Talks

	•	Patents
	•	Awards
	•	Miscellaneous
	•	Vita (PDF)

Henry S. Baird

Research on DICE:

Document Image Content Extraction

--- Supported by the Defense Research Projects Agency (DARPA)
Information Processing Technology Office (ITP) ---

Given an image of a document,
find regions ('zones') containing handwriting, machine-print text,
graphics, line-art, logos, photographs, noise, etc.

Solutions to this problem have applications in digital libraries, web search, intelligence analysis, and office automation.

DICE Project members (Summer 2005):

Michael Moll
Jean Nonnemaker
Matt Casey
Don Delorenzo
Henry Baird

& continuing into the Fall '06 with:
Michael Moll
Matt Casey

These papers report early results:

``Towards Versatile Document Analsyis Systems,'' (w/ M. R. Casey) Proc., 7th IAPR Document Analysis Workshop (DAS'06), Nelson, New Zealand, February 12-15, 2006. [PDF, PPT]

``Versatile Document Content Extraction,'' (w/ M. A. Moll, J. Nonnemaker, M. R. Casey, D. L. Delorenzo) Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 18-19, 2006. [PDF, PPT]

``Distinguishing Mathematics Notation from English Text using Computational Geometry,'' (w/ D. Drake) Proc., IAPR 8th Int'l Conf. on Document Analysis and Recognition (ICDAR2005), Seoul, Korea, August 31 - September 1, 2005. [PDF, PPT]

More details below.....

Challenges include:

vast diversity of document types:

languages & writing systems
typefaces and handwriting styles
sizes of text, e.g. more than one per page
page layouts
overlapping regions, e.g. HW notes on MP
orientation: e.g. rightside up, upside down
skew, shear, & other geometric deformations
image degradations: blur, thresholding, additive noise, etc

variety of image types:

color, grey, black-and-white
compressed vs uncompressed; and if compressed, lossy or lossless
carefully scanned with controlled illumination vs carelessly snapped
scanned flat vs in perspective, curved, etc

lack of consensus in the research community on how to evaluate performance, e.g. how should 'regions' be described: as sets of pixels, or as unions of interiors of sets of rectangles or polygons?
desirability of extremely high speed in classification so that, e.g., they can be applied repeatedly to large collections of document images
need for easily improvable methods, i.e. automatically trainable from large data sets to achieve improved performance on previously unseen test data for which the training data is "representative"
serious unanswered (unanswerable?) methodological questions in specifying what a "representative" data set is, and daunting practical obstacles to collecting one no matter how it's defined
high expense and numbing tedium of preparing correctly labeled ("ground-truthed") samples in the large numbers needed for training and testing

Research strategies that we will emphasize:

Extremely high speed classification: ideally, nearly at I/O rates (as fast as the images can be read)
Voracious classifiers: trainable on billions of samples in reasonable time
Versatility first: concentrate on methods that work across the broadest possible variety of cases (document and image types)
Confidence dominates accuracy: good estimates of the confidence of decisions is important; high accuracies are desirable but even modest accuracy can be useful
Data-driven design: we won't invest much effort in making what are, ultimately, arbitrary engineering decisions such as choice of preprocessors and features; instead, we'll try to allow the training data to determine these automatically as far we can
Amplification: use real ground-truthed training samples as 'seeds' for massive synthetic generation of pseudorandomly perturbed samples for use in supplementary training
Infinite space: i.e. design for a future when main memory will be orders of magnitude larger and faster than today

Meeting 05may16:

    Team members
        Michael Moll, Jean Nonnemaker, Matt Casey, Don Delorenzo,
        & Henry Baird gather for the first time.

    Software engineering decisions:
         We'll code in C++/C
         compile, run under Linux (Fedora Core IV)
         man pages / documentation:
                Doxygen: special comments added to source code
                  can generate Unix man pages, LaTeX, HTML, ...
version control: SubVersion
        generally we'll try to reuse exiusting open-source code
              (under, e.g. GPL-like licenses)
        also we'll try to write code that might be palatable to
              the open source community

Top-down refinement of tasks

User view: an end-user sees this functionality:

    DICE-classifier:
        INPUT: any image of any document (page)
        OUTPUT: regions classified by their content

Kinds of document images accepted (ideally)
        Color
        Grey-level
        B&W (bilevel)
        ---
        any size (hgt x wid in pixels)
        any "resolution" (pixels/inch), spatial sampling rate
        --
        any of a wide range of image file formats, e.g.
           TIFF
           JPEG
           PNG
        CCITT G3/G4
           ...
We choose to convert all input image file formats
to a PNG file format with pixels encoded with
three components in the HSL color space:
Hue
Saturation
Luminance
    the infomation in bilevel and grey images map into
the luminance component
(There are color spaces in which Euclidean distance
          correlates well with human perception of color difference;
we didn't pursue this....)
We decided to use one byte per channel, on the grounds
that more levels aren't needed in this application (but
George Nagy commented that we would need more
to handle Xrays, standard color test targets, etc).
We will use another 8-bit field to encode "content class"
(CTCL). Some files will possess this channel in addition
to the HSL channels: such a file will be called an HSLC
file.

    How shall "regions" be described?
       the option that dominates the literature is:
            A. Rectangular orthogonal (with X-axis & Y-axis
parallel sides) boxes, indicating "zones",
                 labeled with class.
             Semantics:    every pixel within the box is of that class.
             Advantages:
                 - most preexisting ground-truthed image data is
                     described this way; and
                 - most methods for scoring classifier success use them.
             Problems:  what if true zones have slanted or
curved sides?
       B. Interesting and attractive alternative option: label each
   *pixel* with the class.
Advantages:
             - easy to visualize;
             - trivial to score; and
             - not dependent on an arbitrary and restrictive class
                 familty of zone shapes.
        Decision: we'll use per-pixel ground-truth, and furthermore
the classifier will label pixels with classes.
Caveat: if the rest of the world cares a lot, we can develop
a separate process to merge these into rectangular zones.

   Shall we allow zones to overlap?
        i.e. can a region/pixel possess more than one class?
   Decision (05may26): yes, but we'll allow

Classes: kinds of regions to be distinguished (ideally)
            MP - machine printed text
                  TA - tabular data
                  MA - mathematics notation
                  HD - header
                  FT - footer
            HW - handwriting
                 TA - tabular data
                 MA - mathematics notation
                 etc
            LA - line-art graphics
            PH - continuous-tone photographs
            JK - junk: e.g. margin and gutter noise
            BL - blank: none of the above
Decision: we will focus first on
           MP, HW, PH, BL
   For this summer, we'll allow a maximum of 15 classes total
   and each region can have zero, one, or two classes
   This will be encoded in an 8-bit class channel:
         4-bit class 1    0 encodes "unclassified"
         4-bit class 2
   Zero classes is implied by the absence of the class component.
   Zero classes for a particular pixel can mean that the
     pixel is not yet classified or that the classifier *cannot*
     classify it for any reason (e.g. it is too close to the
     edge of the page)

Engineer's view of DICE:

     DICE-training:
        INPUT:   a set of ground-truthed doc images
        OUPUT: a DICE classifier

A DICE classifier may take the form of 'tableware'
      that a program interprets
      or, it may be a large set of files
      possible a huge directory full of large files
      .. we don't know yet

Both the DICE-classifier and the DICE-training
use this important stage for extracting "features" of
pixels.

    DICE-features:
    INPUT: a set of doc-image files
    OUTPUT: a set of pixel locations each
       with a list of numerical features

How many features?   What's the dimension 'd' of the data?
    don't know yet: maybe hundreds
    Dimension 'd' is fixed for all stages of the DICE system
    Programs will be written to handle *any* value of 'd'
    'd' will be determined by the feature extractor code

DICE-testing
     - run the DICE-classifier on doc-images w/ gt
     - compare classifier output with the gt
     - compute classifier accuracy scores
     - summarize, visualize, & analyze

Collection of training and text data
     - find sets of doc-images with labeled
     - download them
     - convert images to our internal format
     - convert gt to our internal format
- maybe, if we can't borrow enough, create our own
by "amplification": generate synthetic data that
differs in a controlled way from "real" training
   and test data, using models of common variations

Programming guidelines (not binding)
     try to keep the executables simple, i.e.
         stdin/stdout where possible (no complex file handling)
     handle directory trees in shell scripts (e.g. PERL, Python, bash)

Features file
      probably:
         for each pixel, a list of 'd' numerical values (d ~= 15-20)
     file format
         class f1 f2 f3 ... f<d>      // one ASCII line per pixel
         e.g.
                 MP 2 10 554 30.2 100.1 ....
         // Note: maybe should implement continuation, e.g.
     possible features:
           H S & L values for
            the pixel itself
                each of the neighboring 8 pixels (or, 4 pixels)
                neighboring summed 3x3 neighborhoods (down-res'ed x9)
                maybe drop the H & S components, keep only L
          maybe use Haar transforms
          maybe use wavelets
          ....
      maybe normalize digitizing resolution (pixels/inch) to a
         fixed conventional value, e.g. 200 ppi

HSLC file format
      a legal variant on
             PNG   http://www.libpng.org/pub/png/book/

Further records of decisions, algorithms, experiments are continued
on our project WIKI site:
http://snake-eyes.cse.lehigh.edu/wiki
Feel free to take a look!