WWW Search Engines Project 1

Goal

To download, install, and compare two search engines.

Teams

For this project, we will have five small teams. They are: Projects are intended to be implemented using pair programming techniques. That is, all members are expected to simultaneously work on the bulk of the project (sharing one workstation); this encourages both peer teaching and better coding as multiple eyes are understanding what is being done.

Systems

The SMART text analysis and retrieval system was developed by Gerry Salton, primarily at Cornell, but dates back to his time at Harvard in the 1960s.

The other systems are modern, being written sometime after SMART development ceased.

  1. Lucene (the basis for the open source Nutch search engine effort) can be found at http://jakarta.apache.org/lucene/docs/index.html
  2. Zettair (once called Lucy) can be found at http://www.seg.rmit.edu.au/zettair/
  3. mg (the system associated with the Managing Gigabytes text) can be found at http://www.cs.mu.oz.au/mg/
  4. DataparkSearch Engine is available from http://www.dataparksearch.org/
  5. Lemur (focusing on language modelling) is found at http://www-2.cs.cmu.edu/~lemur/

Your Task

Your group will retrieve and install both SMART and the retrieval system assigned to you. All systems should be installable on either Solaris or Linux. Many gigabytes of available file space can be found in /proj/searchengines/ on the Suns. A linux system is also available: wume4.cse.lehigh.edu uses the CSE/ECE Sun accounts and filesystems. While I think it is unlikely that the task will need it, if your system needs more than the 512MB found on most Sun workstations, use europa.eecs.lehigh.edu, as it has 4GB of RAM. Please contact me immediately (within the first few days) if you are unable to retrieve and install your software.

SMART is distributed with a number of standardized data sets. These datasets (CISI, CRAN, MED, CACM, etc.) also include relevance judgements for a given set of queries. We will focus on just the CRAN dataset. You need to index the CRAN dataset, and evaluate SMART and your assigned system using the queries provided. Two versions of each system should be evaluated -- one using as many default settings as possible (preferably however the system was distributed), and one using modified settings from your experiments to get better performance. In most cases, you will need to write some software to convert the SMART-formatted CRAN dataset into something that is useful in your system. Likewise, you may need to convert the results of your assigned system into something that can be matched against the relevance judgements provided.

In the end, you should collect rankings for each of the 225 queries in the CRAN dataset for each system and each system configuration. Calculate the 11-point average interpolated precision to determine the best performing system and configuration on this dataset.

What to Hand In

You will need to hand in a report, containing at least the following four sections:
  1. Installation and usage experiences --- tell the story of what you expected, what didn't work, and how you faced any challenges in installing and operating the retrieval software.
  2. User installation guide --- a straightforward guide to installing and using the software so that another student in the class would not face the same difficulties that you may have encountered.
  3. A comparative evaluation of the two systems (SMART vs. your assigned retrieval system). Provide not only quantitative performance results (e.g., the 11-point average interpolated precision mentioned above) but qualitative comparisons in terms of ease of use, resources needed, features provided, etc. Help the reader understand which one is better in which aspects, and why.
  4. Appendices covering any (substantial) coding that was performed (conversion scripts, statistics generating, etc.).
You will also give a short (5 minutes, 2-3 slide) summary of your findings in class.

Important Dates

A status report telling me that you have completed installation is due by email on Thursday September 9th (up until midnight). The completed assignment and written report is provisionally due September 17th, at the beginning of class. Presentations will be held during class on September 20th.

Grading

I suggest you read my advice on what I think constitutes good writing, programming, and presentations. Your report and in-class summary will be graded on clarity, correctness, and presentation, so make it professional. Any code will also be examined for clarity and correctness.

Finally, you will also be asked for for an evaluation of yourselves and each other -- the relative contribution, effort provided, and quality of work. This will be another component of your project grade.


Last modified 3 September 2004, by Brian Davison.