WWW Search Engines Course Project

The long-term goal for this project is the implementation of a fully-functional search engine. It will crawl, index, and query the contents of millions of web pages. To accomplish this goal over the remaining part of this semester, we will break up the project into stages and implement them in groups.

Note that you will be asked for an evaluation of the work performed by yourselves and that performed by the other member(s) of each team. Initial team assignments will be made by Prof. Davison.

Each stage will typically require three documents, each of which must be signed by all members of the team. Each document should be written in HTML so that it can be placed online easily for others to read. They are:

In general, these documents should also be written to help convince others why they should use your code.

A search engine can be broken up into a few major parts:

The initial implementation of each system should be basic, providing the required functionality in a modular way so that it can be extended later.

Grading

I suggest you read what I think constitutes good programming. Grading generally will be based on:


Project 1: Implement Back-End Parser and Indexer

We have now covered the topics relevant to implementing the parser and indexer. The parser and indexer can be implemented as separate programs, but for smaller engines (like ours) are more likely to be combined into a single program. We discussed what needs to be included, including determining what to index, whether to give emphasis to certain kinds of text, indexing links, etc.


Project 2: Implement Retriever/Ranker and Interface

The result of this project is a fully-functional, text-based retrieval and ranking system, with a Web (CGI) interface. Ranking should make use of some high-quality term-weighting and ranking function -- we will compare the results of the various engines. Engines should also incorporate anchor text. I recommend (but do not require) that the CGI contact a persistent server.


Project 3: Add crawling or link-based authority ranking

The result of this project is a useful search engine, either for Lehigh internal use, or of the UK dataset.

This project may be implemented by teams of your own choosing (2-3 members).


This page is http://www.cse.lehigh.edu/~brian/course/2007/searchengines/project.html
Last revised: 1 April 2007, Brian D. Davison.