WWW Search Engines:
Algorithms, Architectures and Implementations

Description (Fall 2004)

InstructorProf. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/
Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function.

This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in class projects involving both the creation and management of a large document collection on the WWW. These projects will require programming in languages such as Perl/CGI, C/C++, or Java.

Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects.
Prerequisites: CSE 109 Systems Programming or graduate status
Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval.
Expected Work: Homework, presentations, and group programming projects
Examinations: Two hourly midterms (no final exam)
Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces.
Textbook(s): All students: Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999). Additionally required for 497 students: Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti (2003);
Syllabus
The course syllabus is available.

Class Notes (OpenOffice source files available upon request.)
Credits: Significant portions of these notes are derived from others, including Richard Belew, Soumen Chakrabarti, Mark Levene, and Ganesh Ramakrishnan. Thanks!
DayDateTopic(s)
MonAugust 23Welcome
WedAugust 25Background
FriAugust 27Class Cancelled
MonAugust 30Evaluation Example (XLS)
WedSeptember 1Text Preparation
FriSeptember 3Indexing
MonSeptember 6Indexing, continued; Vector-space model
WedSeptember 8Queries, Feedback, Compression
FriSeptember 10Finish Compression
MonSeptember 13Finish Indexing; Start Clustering
WedSeptember 15More clustering, dimension reduction
FriSeptember 17Dimension reduction
MonSeptember 20Project 1 presentations
WedSeptember 22Finish Clustering
FriSeptember 24 PLSI (PPT)
Topic Hierarchies
MonSeptember 27Recommender Systems
Start Supervised Learning
WedSeptember 29Supervised Learning
FriOctober 1Bayesian Learning
MonOctober 4Review Sample Exam 1 Questions
WedOctober 6More review
FriOctober 8NO CLASS - Pacing Break
MonOctober 11Hourly Exam
WedOctober 13Discriminative Classifiers
FriOctober 15Semisupervised Learning
MonOctober 18 Review Exam 1
Start Social Networks
WedOctober 20Link Analysis; PageRank
FriOctober 22Project 2 presentations
MonOctober 25 Link Analysis; HITS
WedOctober 27 Discussed "Improved Algorithms for Topic Distillation in a Hyperlinked Environment", by Bharat and Henzinger, 1998, and "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text", by Chakrabarti et al, 1998.
FriOctober 29 Link nepotism
MonNovember 1Peer Evalution (project 2); Discuss "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Brin and Page, 1998, and "Topic-Sensitive PageRank", Haveliwala, 2002.
WedNovember 3Paper presentations: Combining Link and Content Information in Web Search (Hogg) and The Missing Link" (Goel, PPT)
FriNovember 5 Link analysis; measuring and modeling the Web
MonNovember 8 Paper: What's new on the Web? (Scheirer, PPT)
DiscoWeb
Resource Discovery
WedNovember 10Review Sample Exam 2 Questions
FriNovember 12Present proposals to other groups for feedback
MonNovember 15Hourly Exam
WedNovember 17No class -- group meetings
FriNovember 19Paper presentations: The Link Database (Qi, PPT) and The Google File System (Erekson, PPT)
MonNovember 22Resource Discovery
Scaling to the Web
WedNovember 24NO CLASS: Thanksgiving Break
FriNovember 26NO CLASS: Thanksgiving Break
MonNovember 29Paper presentations: Scaling Personalized Web Search (Nie, PPT) and Crawling the Hidden Web (Garcia, PPT)
WedDecember 1 Scaling to the Web
Web Crawling
FriDecember 3Web Crawling
SatDecember 1112-3pm Final Presentations in PL258
Homework/Projects
HomeworkDueName
Project 3November 18/23, December 6/11CiteSeer Metadata
HW 3November 8Paper Review + Exam Questions
Project 2October 6/13/22Text Classification/Dimension Reduction
HW 2September 29Exam Review Questions
Project 1September 9/20Search Engine Comparative Evaluation
HW 1September 1Search Engine Interface
Announcements
Useful links

This page is http://www.cse.lehigh.edu/~brian/course/searchengines/
Last revised: 30 November 2004.