Goal
To implement dimension reduction techniques in a text classifier.
Teams
For this project, we will have five small teams. They are:
- Team 1: Vinay Goel and Andy Powers
- Team 2: Chris Janneck and Chad Hogg
- Team 3: Philip Garcia and Lan Nie
- Team 4: Keith Erekson, Matthew Kreibel, and Walter Scheirer
- Team 5: Chris Kramer and Xiaoguang Qi
Projects are intended to be implemented using pair programming techniques.
That is, all members are expected to simultaneously work on the bulk of
the project (sharing one workstation); this encourages both peer teaching
and better coding as multiple eyes are understanding what is being done.
Bow (or libbow) is a
library of C code useful for writing statistical text
analysis, language modeling and information retrieval programs. The
current distribution includes the library, as well as front-ends for
document classification (rainbow), document retrieval (arrow) and document
clustering (crossbow).
In this project, we will be using rainbow (and/or libbow) in text
classification experiments. Bow compiles cleanly (warnings only) on Linux
and Solaris, and is installed in /proj/searchengines/bow on the Suns.
Many gigabytes of available file space can be found in
/proj/searchengines/ on the Suns. A linux system is also available:
wume4.cse.lehigh.edu has dual PIII, 2GB Ram, and uses the CSE/ECE Sun
accounts and filesystems.
While I think it is unlikely that the task will need it, if your system
needs more than the 512MB found on most Sun workstations, use
europa.eecs.lehigh.edu, as it has 4GB of RAM.
Your Task
You have two tasks.
- The first is to demonstrate the ability to use
rainbow on the text classification task using the 20newsgroups dataset.
Use at least two different methods supported by rainbow and at least two
settings for feature selection (e.g., with and without). Thus, you will
have at least four sets of results.
- The second, requiring much more depth, is to incorporate some
dimensionality reduction technique (anything we have discussed, including
LSI, random projections, FastMap, etc.) to generate a reduced
representation of the data set. This representation should then be used
by rainbow (or libbow) for the same classification tasks as in task 1.
What to Hand In
Your final report should look like a conference paper. It should
introduce the problem, describe
background to the problem and techniques used, and some related work. It
should also describe
the experiments and their results and a discussion of the significance of
those results. You need to at least speculate as to the cause of
differences in performance measurements; preferably provide evidence to
support your theories or provide other proof.
A discussion section should emphasize
contributions that your group has provided -- solutions to problems
encountered, and improvements to classification performance, or a better
understanding of the relationships or effect that dimension reduction has
on classification performance.
You will also give a short (5 minutes, 2-3 slide) summary of your findings
in class.
Important Dates
A report detailing the results of task 1 is due by email Wednesday
October 6. An email outlining the approach being taken for task 2 is
due Wednesday October 13.
The completed assignment and written report is due
October 22nd, at the beginning of
class. Code does not need to be handed in, but it must be present in your
group directory, and the report (which should discuss your code) should
also specify its location. Presentations will be held during class on
October 22nd.
Grading
I suggest you read my advice on what
I think constitutes good writing, programming, and presentations.
Your report and in-class summary will be graded on clarity,
correctness, and
presentation, so
make it professional. Code will also be examined for clarity and
correctness.
Finally, you will also be asked for for an evaluation of
yourselves and each other -- the relative contribution, effort provided,
and quality of work. This will be another component of your project
grade.
Last modified 18 October 2004, by Brian Davison.