Project 2
Outline
- Again, arbitrary groups of mixed backgrounds
- See Project 2 Group Assignments
- Choose team name
- Need to divide up tasks, including leader/organizer, coders
- Use pair programming techniques to enable rapid development and
cross-teaching of knowledge and skills
- Data
- Same docs from a recent crawl of the UK
But you get bonus points if you are able to use all of them.
- In /proj/searchengines/data (15GB compressed for all docs)
- Additional space in /proj/searchengines2/
- Resources
- Existing codebase of indexers from project 1
- Any language you want
- Any supporting libraries you can find
- Run on suns (including new fast AMD-based machines)
- Grading
- Correctness (right results), quality (robust, not crashing)
- Documentation (reports, code comments)
- Presentation
- Scale -- handling full dataset (bonus points)
- Query types -- handling additional query types (bonus points)
- Tasks
- Select existing high-quality ranking function (the SMART version
of TFIDF, BM2500 (based on BM25 [2-4]), or likely the best: BM25F
[1]) to implement
- Add anchor text to algorithm (if using TFIDF or BM2500)
- Extend indexer to build all structures needed (if not already present,
including anchor text indexing, document description data for result
listings)
- Extend indexer and API to support indexing of links (both directions)
and retrieval of inlinks or outlinks given a URL
- Build web interface (similar to hw#1) that can accept multi-term
queries including phrases and inlink/outlinks of URLs
- Document everything
- Design document: as described for all projects, including
choice of ranking function, choice of underlying indexer to
use, and assignment of tasks
- Implementation document: also include actual memory and disk usage
Deadlines
-
- A printed (and signed) copy of the design document is due in class
Friday,
March 16.
- The completed module
and implementation document is due electronically Sunday,
April 1, and each group will present their system and demonstrate
its capabilities in class Monday April 2 (and hand-in a printed and
signed copy of the implementation document).
[1] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple BM25
extension to multiple weighted fields. In CIKM '04: Proceedings of the
thirteenth ACM conference on Information and knowledge management, pages
42--49, New York, NY, USA, 2004. ACM Press.
[2] S E Robertson and S Walker. Some simple effective
approximations to the 2-poisson model for
probabilistic weighted retrieval. In W B Croft and C J
van Rijsbergen, editors, SIGIR ’94: Proceedings of the
17th Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval,
pages 345–354. Springer-Verlag, 1994.
[3] See http://www.xapian.org/docs/bm25.html for some discussion and
implementation hints.
[4]
S. Robertson, S. Walker, M. M. Beaulieu, M. Gatford, and A. Payne. Okapi
at TREC-4. In NIST Special Publication 500-236: The Fourth Text REtrieval
Conference (TREC-4), pages 73-96, Gaithersburg, MD, 1995.
Last revised: 28 March
2007, Brian D. Davison.