Goal
To download, install, and compare two search engines.
Teams
For this project, we will have five small teams. They are:
- Team 1: Vinay Goel and Matthew Kreibel (Lucene)
- Team 2: Chris Janneck and Xiaoguang Qi (Zettair)
- Team 3: Philip Garcia and Chad Hogg (mg)
- Team 4: Keith Erekson and Lan Nie (DataparkSearch)
- Team 5: Chris Kramer, Andy Powers, and Walter Scheirer (Lemur)
Projects are intended to be implemented using pair programming techniques.
That is, all members are expected to simultaneously work on the bulk of
the project (sharing one workstation); this encourages both peer teaching
and better coding as multiple eyes are understanding what is being done.
Systems
The SMART text analysis and
retrieval system was developed by Gerry Salton,
primarily at Cornell, but dates back to his time at Harvard in the 1960s.
The other systems are modern, being written sometime after SMART
development ceased.
- Lucene (the basis for the open source Nutch search engine effort) can be found
at http://jakarta.apache.org/lucene/docs/index.html
- Zettair (once called Lucy) can be found at http://www.seg.rmit.edu.au/zettair/
- mg (the system associated with the Managing Gigabytes text) can
be found
at http://www.cs.mu.oz.au/mg/
- DataparkSearch Engine is available from http://www.dataparksearch.org/
- Lemur (focusing on language modelling) is found at http://www-2.cs.cmu.edu/~lemur/
Your Task
Your group will retrieve and install both SMART and the retrieval system
assigned to you. All systems should be installable on either Solaris or
Linux. Many gigabytes of available file space can be found in
/proj/searchengines/ on the Suns. A linux system is also available:
wume4.cse.lehigh.edu uses the CSE/ECE Sun accounts and filesystems.
While I think it is unlikely that the task will need it, if your system
needs more than the 512MB found on most Sun workstations, use
europa.eecs.lehigh.edu, as it has 4GB of RAM. Please contact me
immediately (within the first few days) if you are unable to retrieve and
install your software.
SMART is distributed with a number of standardized data sets. These
datasets (CISI, CRAN, MED, CACM, etc.) also include relevance judgements
for a given set of queries. We will focus on just the CRAN dataset.
You need to index the CRAN dataset, and evaluate SMART and your assigned
system using the queries provided. Two versions of each system should be
evaluated -- one using as many default settings as possible (preferably
however the system was distributed), and one using modified settings from
your experiments to get better performance. In most cases, you will need
to write some software to convert the SMART-formatted CRAN dataset into
something that is useful in your system. Likewise, you may need to
convert the results of your assigned system into something that can be
matched against the relevance judgements provided.
In the end, you should collect rankings for each of the 225 queries in
the CRAN dataset for each system and each system configuration.
Calculate the 11-point average interpolated precision to determine the
best performing system and configuration on this dataset.
What to Hand In
You will need to hand in a report, containing at least the following
four sections:
- Installation and usage experiences --- tell the story of what you
expected, what didn't work, and how you faced any challenges in
installing and operating the retrieval software.
- User installation guide --- a straightforward guide to installing and
using the software so that another student in the class would not face
the same difficulties that you may have encountered.
- A comparative evaluation of the two systems (SMART vs. your assigned
retrieval system). Provide not only quantitative performance results
(e.g., the 11-point average interpolated precision mentioned above)
but qualitative comparisons in terms of ease of use, resources needed,
features provided, etc. Help the reader understand which one is better
in which aspects, and why.
- Appendices covering any (substantial) coding that was performed
(conversion scripts, statistics generating, etc.).
You will also give a short (5 minutes, 2-3 slide) summary of your findings
in class.
Important Dates
A status report telling me that you have completed installation is due by
email on Thursday September 9th (up until midnight).
The completed assignment and written report is provisionally due
September 17th, at the beginning of
class. Presentations will be held during class on September 20th.
Grading
I suggest you read my advice on what
I think constitutes good writing, programming, and presentations.
Your report and in-class summary will be graded on clarity,
correctness, and
presentation, so
make it professional. Any code will also be examined for clarity and
correctness.
Finally, you will also be asked for for an evaluation of
yourselves and each other -- the relative contribution, effort provided,
and quality of work. This will be another component of your project
grade.
Last modified 3 September 2004, by Brian Davison.