As discussed in class on Thursday, September 12, the goal for this project is the implementation of a Lehigh University search engine. It will crawl, index, and query the contents of the tens of thousands of pages associated with Lehigh. To accomplish this goal over the remaining part of this semester, we will initially break up into pairs (to allow for rapid devlopment using pair programming), and have at least two teams implement each part of the system (so that later stages can choose the best implementation).
Note that you will be asked for an evaluation of the work performed by yourselves and that performed by the other member(s) of each team. Initial team assignments will be made by Prof. Davison.
Each stage will typically require three documents, each of which must be signed by all members of the team. Each document should be written in HTML so that it can be placed online easily for others to read. They are:
A search engine can be broken up into a few major parts:
We have covered the topics relevant to implementing the parser and indexer, so we start with these modules. The parser and indexer can be implemented as separate programs, but for smaller engines (like ours) are more likely to be combined into a single program.Team assignments will be sent via email. A sample file containing multiple pages has been prepared so that you can get an idea of what the crawler will create.
Due dates:
- A printed copy of the design document is due in class Thursday, September 19.
- A printed copy of the API is due in class Thursday, September 26.
- The completed module is due (by email), along with a printed copy of the implementation document, in class Tuesday, October 8.
Due dates:
- A printed copy of the design document for both parts, including API specifications, in class Tuesday, October 22.
- The completed system (by email), along with a printed copy of your implementation document is due in class Thursday, October 31.
The crawler will create one or more files such as sample1.txt, in which each HTML page within the file is enclosed within <doc></doc> tags, and also contain the page URL within <url></url> and a unique val ue within <docid></docid>. Crawlers must obey robots.txt directives, and must not revisit any host more than once every ten seconds. You must also add your email address to the UserAgent designation used by the crawler, and put this web page as the referrer. This way any webmaster that has received requests from your crawler might know how to contact you and about this assignment. While in general I encourage the re-use of publicly available software and libraries, you may not just modify an existing crawler (such as wget). OTOH, you may use libraries (for example) that provide robots.txt parsing. The crawler sh ould crawl all lehigh.edu Web pages it can find, and generate a dataset appropriate for the other parts of the system.Possible extensions include: compression, link analysis, query expansion, LSI, phrase/expression matching, indexing link text. Please see me about your plans for extensions before handing in a proposal.
This project may be implemented individually or in teams of your own choosing. Only members of the team that created a system in part 2 should propose extensions to it.
Due dates:
- A printed copy of the proposal/design document for the extension or crawler, in class Thursday, November 14.
- The completed system (by email), along with a printed copy of an implementation document, in class Tuesday, December 3.