Description (Fall 2004)
Syllabus
Instructor Prof. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function. This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in class projects involving both the creation and management of a large document collection on the WWW. These projects will require programming in languages such as Perl/CGI, C/C++, or Java.
Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects. Prerequisites: CSE 109 Systems Programming or graduate status Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval. Expected Work: Homework, presentations, and group programming projects Examinations: Two hourly midterms (no final exam) Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces. Textbook(s): All students: Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999). Additionally required for 497 students: Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti (2003);
The course syllabus is available.Class Notes (OpenOffice source files available upon request.)
Credits: Significant portions of these notes are derived from others, including Richard Belew, Soumen Chakrabarti, Mark Levene, and Ganesh Ramakrishnan. Thanks!Homework/Projects
Day Date Topic(s) Mon August 23 Welcome Wed August 25 Background Fri August 27 Class Cancelled Mon August 30 Evaluation Example (XLS) Wed September 1 Text Preparation Fri September 3 Indexing Mon September 6 Indexing, continued; Vector-space model Wed September 8 Queries, Feedback, Compression Fri September 10 Finish Compression Mon September 13 Finish Indexing; Start Clustering Wed September 15 More clustering, dimension reduction Fri September 17 Dimension reduction Mon September 20 Project 1 presentations Wed September 22 Finish Clustering Fri September 24 PLSI (PPT)
Topic HierarchiesMon September 27 Recommender Systems
Start Supervised LearningWed September 29 Supervised Learning Fri October 1 Bayesian Learning Mon October 4 Review Sample Exam 1 Questions Wed October 6 More review Fri October 8 NO CLASS - Pacing Break Mon October 11 Hourly Exam Wed October 13 Discriminative Classifiers Fri October 15 Semisupervised Learning Mon October 18 Review Exam 1
Start Social NetworksWed October 20 Link Analysis; PageRank Fri October 22 Project 2 presentations Mon October 25 Link Analysis; HITS Wed October 27 Discussed "Improved Algorithms for Topic Distillation in a Hyperlinked Environment", by Bharat and Henzinger, 1998, and "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text", by Chakrabarti et al, 1998. Fri October 29 Link nepotism Mon November 1 Peer Evalution (project 2); Discuss "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Brin and Page, 1998, and "Topic-Sensitive PageRank", Haveliwala, 2002.
Wed November 3 Paper presentations: Combining Link and Content Information in Web Search (Hogg) and The Missing Link" (Goel, PPT) Fri November 5 Link analysis; measuring and modeling the Web Mon November 8 Paper: What's new on the Web? (Scheirer, PPT)
DiscoWeb
Resource DiscoveryWed November 10 Review Sample Exam 2 Questions Fri November 12 Present proposals to other groups for feedback Mon November 15 Hourly Exam Wed November 17 No class -- group meetings Fri November 19 Paper presentations: The Link Database (Qi, PPT) and The Google File System (Erekson, PPT) Mon November 22 Resource Discovery
Scaling to the WebWed November 24 NO CLASS: Thanksgiving Break Fri November 26 NO CLASS: Thanksgiving Break Mon November 29 Paper presentations: Scaling Personalized Web Search (Nie, PPT) and Crawling the Hidden Web (Garcia, PPT) Wed December 1 Scaling to the Web
Web CrawlingFri December 3 Web Crawling Sat December 11 12-3pm Final Presentations in PL258
Announcements
Homework Due Name Project 3 November 18/23, December 6/11 CiteSeer Metadata HW 3 November 8 Paper Review + Exam Questions Project 2 October 6/13/22 Text Classification/Dimension Reduction HW 2 September 29 Exam Review Questions Project 1 September 9/20 Search Engine Comparative Evaluation HW 1 September 1 Search Engine Interface