WWW Search Engines Course Project
The long-term goal for this
project is the implementation of a fully-functional
search engine.
It will crawl, index, and query the contents of millions
of web pages. To accomplish this goal over the
remaining part of this semester, we will break up the project into stages
and implement them in groups.
Note that you will be asked for an evaluation of the work performed by
yourselves and that performed by the other member(s) of each team.
Initial team assignments will be made by Prof. Davison.
Each stage will typically require three documents, each of which must
be signed by all members of the team. Each document should be written
in HTML so that it can be placed online easily for others to read. They are:
- Design Document: 2-4 pages describing what your project
will do, how you expect to implement it, what your design options and your
decisions were, and why you made those decisions.
- API Document: how to use your code (without seeing the code),
snippets of sample usage, error conditions, types, etc.
- Implementation Document: 3-4 pages describing what you did, if
different from design document, why, and some statement as to the
performance. That is, does it work, does it do what is needed, how was
it tested? This must be a standalone document (not needing previous
documentation to use and understand it).
In general, these documents should also be written to help convince others
why they should use your code.
A search engine can be broken up into a few major parts:
- Back-End Systems
- Crawler: retrieve pages, extract links, put into files, assign
URL IDs.
- Parser: extract terms from files, extract titles.
- Indexer: create on-disk dictionary, inverted index(es), titles
lookup, URL lookup.
- Front-End Systems
- Interface: get query, normalize and tokenize, ask Retriever for
results, then lookup URL and titles for results.
- Retriever/TextRanker: use inverted indexes to find and
textually score for relevance.
- Link Analyzer: perform link-graph based algorithms to determine
authority/quality of pages.
The initial implementation of each system should be basic, providing
the required functionality in a modular way so that it can be extended
later.
Grading
I suggest you read what
I think constitutes good programming. Grading generally will be based
on:
- Functional, efficient code implementing the required task.
- Clear, concise, and well-written external documentation.
- Extensive comments and modular design of code.
- The assessment of effort made by each group participant.
- The use of your code by other teams/modules.
We have now covered the topics relevant to implementing the parser and
indexer. The parser and indexer can
be implemented as separate programs, but for smaller engines (like
ours) are more likely to be combined into a single program. We discussed
what needs to be included, including determining
what to index, whether to give emphasis to certain kinds of text, indexing
links, etc.
The result of this project is a fully-functional, text-based retrieval and
ranking system, with a Web (CGI) interface. Ranking should make use of
some high-quality term-weighting and ranking function -- we will compare
the results of the various engines. Engines should also incorporate
anchor text. I recommend (but do not require) that the CGI contact a
persistent server.
The result of this project is a useful search engine, either for Lehigh
internal use, or of the UK dataset.
This project may be implemented by teams of your own
choosing (2-3 members).
This page is
http://www.cse.lehigh.edu/~brian/course/2007/searchengines/project.html
Last revised: 1 April 2007, Brian D. Davison.