Project 3
Outline
- Self-selected groups of two or three
- See Project 3 Groups
- Choose team name
- Select task (either crawling or link analysis)
- Need to divide up tasks, including leader/organizer, coders
- Use pair programming techniques to enable rapid development and
cross-teaching of knowledge and skills
- Data
- Same docs from a recent crawl of the UK
- In /proj/searchengines/data (15GB compressed for all docs)
- Additional space in /proj/searchengines2/
- Resources
- Existing codebase of search engines from project 2
- Any language you want
- Any supporting libraries you can find
- Run on suns (including fast AMD-based machines)
- Grading
- Difficulty of task
- Correctness (right results), quality (robust, not crashing)
- Documentation (reports, code comments)
- Presentation
- Scale -- handling full dataset (all 8 archives)
- Peer/self-evaluations
- Audience evaluations
- Crawler Tasks
- Build a web crawler to collect web pages at Lehigh.
- Crawlers must obey
robots.txt
directives, and
must not revisit any host more than once every second (this will
significantly slow your crawl). You must
also add your email address to the UserAgent designation used by the
crawler, and put this web page as the referrer. This way any
webmaster that has received requests from your crawler might know how
to contact you and about this assignment.
- Include lehighsports.com and other Lehigh
pages that are not simply part of lehigh.edu.
- Prepare to handle roughly 500K pages.
- Either create warc files or extend a project 2 implementation to
read your files.
- Demonstrate a working project 2 system operating over lehigh
content (web interface).
- Provide a performance/quality comparison to the Lehigh intranet
search engine.
- Link Analysis Tasks
- Extend a project 2 implementation to incorporate link-based
authority ranking
- Could include any variation of HITS or PageRank (such as PHITS and
Topic-sensitive PageRank)
- Must operate on full UK dataset
- Must include tests (perhaps on toy problems) to demonstrate that
your matrix code works
- Will likely need to put all URLs in canonical form (downcase
hostname, eliminate in-page positions '#', regularize encodings (e.g.,
replace %20 with space))
- Provide a quality comparison of performance versus no link analysis
- Document everything
- Design document: as described for all projects, including
assignment of tasks
- Implementation document: also include actual memory and disk usage
as well as a full and complete description of what was implemented
Deadlines
- Team members, high-level task, and team-name is due by email Friday
April 6 (earlier preferred!).
- A printed (and signed) copy of the design document is due in class
Friday,
April 20 (note that implementation should be well underway...)
- The completed system
and implementation document is due electronically on April 30, and
each group will present their system and demonstrate its capabilities
during our final exam slot on Wednesday May 2
(and hand-in a printed and signed copy of the implementation document).
Last revised: 4 April
2007, Brian D. Davison.