WWW Search Engines Course Project

As discussed in class on Thursday, September 12, the goal for this project is the implementation of a Lehigh University search engine. It will crawl, index, and query the contents of the tens of thousands of pages associated with Lehigh. To accomplish this goal over the remaining part of this semester, we will initially break up into pairs (to allow for rapid devlopment using pair programming), and have at least two teams implement each part of the system (so that later stages can choose the best implementation).

Note that you will be asked for an evaluation of the work performed by yourselves and that performed by the other member(s) of each team. Initial team assignments will be made by Prof. Davison.

Each stage will typically require three documents, each of which must be signed by all members of the team. Each document should be written in HTML so that it can be placed online easily for others to read. They are:

In general, these documents should also be written to help convince others why they should use your code.

A search engine can be broken up into a few major parts:

The initial implementation of each system should be basic, providing the required functionality in a modular way so that it can be extended later.

Grading

I suggest you read what I think constitutes good programming. Grading will be based on:


Project 1: Implement Back-End Parser and Indexer

We have covered the topics relevant to implementing the parser and indexer, so we start with these modules. The parser and indexer can be implemented as separate programs, but for smaller engines (like ours) are more likely to be combined into a single program.

Team assignments will be sent via email. A sample file containing multiple pages has been prepared so that you can get an idea of what the crawler will create.

Due dates:


Project 2: Implement Retriever/Ranker and Interface

Due dates:


Project 3: Implement Crawler and/or Extensions

The crawler will create one or more files such as sample1.txt, in which each HTML page within the file is enclosed within <doc></doc> tags, and also contain the page URL within <url></url> and a unique val ue within <docid></docid>. Crawlers must obey robots.txt directives, and must not revisit any host more than once every ten seconds. You must also add your email address to the UserAgent designation used by the crawler, and put this web page as the referrer. This way any webmaster that has received requests from your crawler might know how to contact you and about this assignment. While in general I encourage the re-use of publicly available software and libraries, you may not just modify an existing crawler (such as wget). OTOH, you may use libraries (for example) that provide robots.txt parsing. The crawler sh ould crawl all lehigh.edu Web pages it can find, and generate a dataset appropriate for the other parts of the system.

Possible extensions include: compression, link analysis, query expansion, LSI, phrase/expression matching, indexing link text. Please see me about your plans for extensions before handing in a proposal.

This project may be implemented individually or in teams of your own choosing. Only members of the team that created a system in part 2 should propose extensions to it.

Due dates:


This page is http://www.cse.lehigh.edu/~brian/course/searchengines/project.html
Last revised: 24 November 2002.