Lehigh University
COLLEGE HOME | LEHIGH HOME | SEARCH



•  Home
•  Research
•  Courses
•  Prof'l Activities
•  Conferences
•  Publications
•  Recent Talks
•  Patents
•  Awards
•  Miscellaneous
•  Vita (PDF)


   


Henry S. Baird

Research on DICE:

Document Image Content Extraction



Given an image of a document,
find
regions ('zones') containing handwriting, machine-print text,
       graphics, line-art, logos, photographs, noise, etc.

Solutions to this problem have applications in digital libraries, web search, intelligence analysis, and office automation.

DICE Project members (Summer 2005):
Michael Moll
Jean Nonnemaker
Matt Casey
Don Delorenzo
Henry Baird
& continuing into the Fall '06 with:
           Michael Moll
           Matt Casey

These papers report early results:

``Towards Versatile Document Analsyis Systems,'' (w/ M. R. Casey) Proc., 7th IAPR Document Analysis Workshop (DAS'06), Nelson, New Zealand, February 12-15, 2006. [PDF, PPT]

``Versatile Document Content Extraction,'' (w/ M. A. Moll, J. Nonnemaker, M. R. Casey, D. L. Delorenzo) Proc., SPIE/IS&T Document Recognition & Retrieval XIII Conf., San Jose, CA, January 18-19, 2006. [PDF, PPT]

``Distinguishing Mathematics Notation from English Text using Computational Geometry,'' (w/ D. Drake) Proc., IAPR 8th Int'l Conf. on Document Analysis and Recognition (ICDAR2005), Seoul, Korea, August 31 - September 1, 2005. [PDF, PPT]

More details below.....

Challenges include:
  1. vast diversity of document types:
    1. languages & writing systems
    2. typefaces and handwriting styles
    3. sizes of text, e.g. more than one per page
    4. page layouts
    5. overlapping regions, e.g. HW notes on MP
    6. orientation:  e.g.  rightside up, upside down
    7. skew, shear, & other geometric deformations
    8. image degradations:  blur, thresholding, additive noise, etc
  2. variety of image types:
    1. color, grey, black-and-white
    2. compressed vs uncompressed; and if compressed, lossy or lossless
    3. carefully scanned with controlled illumination vs carelessly snapped
    4. scanned flat vs in perspective, curved, etc
  3. lack of consensus in the research community on how to evaluate performance, e.g. how should 'regions' be described:  as sets of pixels, or as unions of interiors of sets of rectangles or polygons?
  4. desirability of extremely high speed in classification so that, e.g., they can be applied repeatedly to large collections of document images
  5. need for easily improvable methods, i.e. automatically trainable from large data sets to achieve improved performance on previously unseen test data for which the training data is "representative"
  6. serious unanswered (unanswerable?) methodological questions in specifying what a "representative" data set is, and daunting practical obstacles to collecting one no matter how it's defined
  7. high expense and numbing tedium of preparing correctly labeled ("ground-truthed") samples in the large numbers needed for training and testing
Research strategies that we will emphasize:
  1. Extremely high speed classification:  ideally, nearly at I/O rates (as fast as the images can be read)
  2. Voracious classifiers:  trainable on billions of samples in reasonable time
  3. Versatility first:  concentrate on methods that work across the broadest possible variety of cases (document and image types)
  4. Confidence dominates accuracy:  good estimates of the confidence of decisions is important; high accuracies are desirable but even modest accuracy can be useful
  5. Data-driven design: we won't invest much effort in making what are, ultimately, arbitrary engineering decisions such as choice of preprocessors and features; instead, we'll try to allow the training data to determine these automatically as far we can
  6. Amplification:  use real ground-truthed training samples as 'seeds' for massive synthetic generation of pseudorandomly perturbed samples for use in supplementary training
  7. Infinite spacei.e. design for a future when main memory will be orders of magnitude larger and faster than today
Meeting 05may16:

    Team members
        Michael Moll, Jean Nonnemaker, Matt Casey, Don Delorenzo,
        & Henry Baird gather for the first time.

    Software engineering decisions:
         We'll code in C++/C
         compile, run under Linux (Fedora Core IV)
         man pages / documentation:
                Doxygen:  special comments added to source code
                      can generate Unix man pages, LaTeX, HTML, ...
        version control:   SubVersion
        generally we'll try to reuse exiusting open-source code
              (under, e.g. GPL-like licenses)
        also we'll try to write code that might be palatable to
              the open source community
       
Top-down refinement of tasks

User view:  an end-user sees this functionality:

    DICE-classifier:
        INPUT:  any image of any document (page)
        OUTPUT:  regions classified by their content

Kinds of document images accepted (ideally)
        Color
        Grey-level
        B&W (bilevel)
        ---
        any size (hgt x wid in pixels)
        any "resolution" (pixels/inch), spatial sampling rate
        --
        any of a wide range of image file formats, e.g.
           TIFF
           JPEG
           PNG
          CCITT G3/G4
           ...
        We choose to convert all input image file formats
         to a PNG file format with pixels encoded with
         three components in the HSL color space:
              Hue
              Saturation
              Luminance
         the infomation in bilevel and grey images map into
         the luminance component
         (There are color spaces in which Euclidean distance
          correlates well with human perception of color difference;
          we didn't pursue this....)
          We decided to use one byte per channel, on the grounds
          that more levels aren't needed in this application (but
          George Nagy commented that we would need more
         to handle Xrays, standard color test targets, etc).
       We will use another 8-bit field to encode "content class"
       (CTCL).  Some files will possess this channel in addition
       to the HSL channels:  such a file will be called an HSLC
       file.

    How shall "regions" be described?
       the option that dominates the literature is:
            A. Rectangular orthogonal (with X-axis & Y-axis
                 parallel sides) boxes, indicating "zones",
                 labeled with class.
                 Semantics:    every pixel within the box is of that class.
                 Advantages:
                     - most preexisting ground-truthed image data is
                       described this way; and
                    - most methods for scoring classifier success use them.
                 Problems:  what if true zones have slanted or
                      curved sides?
           B. Interesting and attractive alternative option:  label each
                *pixel* with the class.
                 Advantages:
                     - easy to visualize;
                     - trivial to score; and
                     - not dependent on an arbitrary and restrictive class
                        familty of zone shapes.
        Decision:  we'll use per-pixel ground-truth, and furthermore
        the classifier will label pixels with classes.
        Caveat:  if the rest of the world cares a lot, we can develop
            a separate process to merge these into rectangular zones.

   Shall we allow zones to overlap?
          i.e.  can a region/pixel possess more than one class?
   Decision (05may26):  yes, but we'll allow

Classes:  kinds of regions to be distinguished (ideally)
            MP  - machine printed text
                  TA - tabular data
                  MA - mathematics notation
                  HD - header
                  FT - footer
            HW - handwriting
                 TA - tabular data
                 MA - mathematics notation
                 etc
            LA - line-art graphics
            PH - continuous-tone photographs
            JK - junk:  e.g. margin and gutter noise
            BL - blank:  none of the above
  Decision:  we will focus first on
           MP, HW, PH, BL
   For this summer, we'll allow a maximum of 15 classes total
   and each region can have zero, one, or two classes
   This will  be encoded in an 8-bit class channel:
         4-bit class 1    0 encodes "unclassified"
         4-bit class 2
   Zero classes is implied by the absence of the class component.
   Zero classes for a particular pixel can mean that the
     pixel is not yet classified or that the classifier *cannot*
     classify it for any reason (e.g. it is too close to the
     edge of the page)

  Engineer's view of DICE:

     DICE-training:
        INPUT:   a set of ground-truthed doc images
        OUPUT:  a DICE classifier

  A DICE classifier may take the form of 'tableware'
      that a program interprets
      or, it may be a large set of files
      possible a huge directory full of large files
      ..  we don't know yet

Both the DICE-classifier and the DICE-training
use this important stage for extracting "features" of
pixels.

    DICE-features:
    INPUT:  a set of doc-image files
    OUTPUT:  a set of pixel locations each
       with a list of numerical features

How many features?   What's the dimension 'd' of the data?
    don't know yet:  maybe hundreds
    Dimension 'd' is fixed for all stages of the DICE system
    Programs will be written to handle *any* value of 'd'
    'd' will be determined by the feature extractor code

DICE-testing
     -  run the DICE-classifier on doc-images w/ gt
     -  compare classifier output with the gt
     -  compute classifier accuracy scores
     -  summarize, visualize, & analyze

Collection of training and text data
     - find sets of doc-images with labeled
     - download them
     - convert images to our internal format
     - convert gt to our internal format
     - maybe, if we can't borrow enough, create our own
         by "amplification":   generate synthetic data that
         differs in a controlled way from "real" training
         and test data, using models of common variations

Programming guidelines (not binding)
     try to keep the executables simple, i.e.
         stdin/stdout where possible   (no complex file handling)
     handle directory trees in shell scripts (e.g. PERL, Python, bash)

 Features file
      probably:
         for each pixel, a list of 'd' numerical values (d ~= 15-20)
     file format
         class f1 f2 f3 ... f<d>      // one ASCII line per pixel
         e.g.
                 MP 2 10 554 30.2 100.1 ....     
         // Note:  maybe should implement continuation, e.g.
     possible features:
           H S & L values for
                the pixel itself
                each of the neighboring 8 pixels (or, 4 pixels)
                neighboring summed 3x3 neighborhoods (down-res'ed x9)
                maybe drop the H & S components, keep only L
          maybe use Haar transforms
          maybe use wavelets
          ....
      maybe normalize digitizing resolution (pixels/inch) to a
         fixed conventional value, e.g. 200 ppi
   
HSLC file format
      a legal variant on
             PNG   http://www.libpng.org/pub/png/book/

 Further records of decisions, algorithms, experiments are continued
 on our project WIKI site:
        http://snake-eyes.cse.lehigh.edu/wiki
 Feel free to take a look!





© 2003 P.C. Rossin College of Engineering & Applied Science
Computer Science & Engineering, Packard Laboratory, Lehigh University, Bethlehem PA 18015