Project 1
Outline
- Arbitrary groups of mixed backgrounds
- See Project 1 Group Assignments
- Choose team name
- Need to divide up tasks, including leader, coders
- Use pair programming techniques to enable rapid development and
cross-teaching of knowledge and skills
- Data
- About 3M docs from a recent crawl of the UK
But we'll use just 440K of them to start (summary0-400.warc.gz).
- In /proj/searchengines/data (15GB compressed for all docs)
- Resources
- Any language you want
- Any supporting libraries you can find
- Run on suns (including new fast AMD-based machines)
- Grading
- Correctness (right results), quality (robust, not crashing)
- Documentation (reports, code comments)
- Use by a future team
- Bonus for best compression achieved
- Presentation
- Tasks
- Parse archives
- Extract terms
- Extract anchor text (index for anchor text is optional)
- Extract links (index for links is optional)
- Build dictionary of terms
- Create postings list for terms
- Create compressed postings list
- Create API for retrieval given term (or term identifier)
- Create simple application using API
- Given a term, output list of all documents containing the term and
positions
- Document everything
- Design document
As described for all projects; also include estimated memory and disk
usage for key data structures
- API documentation
- Implementation document
Deadlines
-
- A printed (and signed) copy of the design document is due in class
Monday,
February 5.
- A printed (and signed) copy of the API is due in class
Monday, February 12.
- The completed module
and implementation document is due in class Friday, February 23.
Please submit the module, along with any API revisions, and implementation
document electronically, along with one printed (and signed) copy of the
implementation document.
Last revised: 29 January 2007, Brian D. Davison.