WWW Search Engines Project 2

Goal

To implement dimension reduction techniques in a text classifier.

Teams

For this project, we will have five small teams. They are: Projects are intended to be implemented using pair programming techniques. That is, all members are expected to simultaneously work on the bulk of the project (sharing one workstation); this encourages both peer teaching and better coding as multiple eyes are understanding what is being done.

Bow (or libbow) is a library of C code useful for writing statistical text analysis, language modeling and information retrieval programs. The current distribution includes the library, as well as front-ends for document classification (rainbow), document retrieval (arrow) and document clustering (crossbow).

In this project, we will be using rainbow (and/or libbow) in text classification experiments. Bow compiles cleanly (warnings only) on Linux and Solaris, and is installed in /proj/searchengines/bow on the Suns.

Many gigabytes of available file space can be found in /proj/searchengines/ on the Suns. A linux system is also available: wume4.cse.lehigh.edu has dual PIII, 2GB Ram, and uses the CSE/ECE Sun accounts and filesystems. While I think it is unlikely that the task will need it, if your system needs more than the 512MB found on most Sun workstations, use europa.eecs.lehigh.edu, as it has 4GB of RAM.

Your Task

You have two tasks.

What to Hand In

Your final report should look like a conference paper. It should introduce the problem, describe background to the problem and techniques used, and some related work. It should also describe the experiments and their results and a discussion of the significance of those results. You need to at least speculate as to the cause of differences in performance measurements; preferably provide evidence to support your theories or provide other proof. A discussion section should emphasize contributions that your group has provided -- solutions to problems encountered, and improvements to classification performance, or a better understanding of the relationships or effect that dimension reduction has on classification performance.

You will also give a short (5 minutes, 2-3 slide) summary of your findings in class.

Important Dates

A report detailing the results of task 1 is due by email Wednesday October 6. An email outlining the approach being taken for task 2 is due Wednesday October 13. The completed assignment and written report is due October 22nd, at the beginning of class. Code does not need to be handed in, but it must be present in your group directory, and the report (which should discuss your code) should also specify its location. Presentations will be held during class on October 22nd.

Grading

I suggest you read my advice on what I think constitutes good writing, programming, and presentations. Your report and in-class summary will be graded on clarity, correctness, and presentation, so make it professional. Code will also be examined for clarity and correctness.

Finally, you will also be asked for for an evaluation of yourselves and each other -- the relative contribution, effort provided, and quality of work. This will be another component of your project grade.


Last modified 18 October 2004, by Brian Davison.