WWW Search Engines: Algorithms, Architectures and Implementations

Description (Fall 2003)

InstructorProf. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/
Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function.

This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in a class project involving both the creation and management of a large document collection on the WWW. This project will require programming in languages such as Perl/CGI, C/C++, or Java.

Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects.
Prerequisites: CSE 109 Systems Programming or graduate status
Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval.
Expected Work: Homework, presentations, and group programming projects
Examinations: Two hourly midterms (no final exam)
Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces.
Textbook(s): Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti (2003); Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999).
Syllabus
The course syllabus is available.

Class Notes
DayDateTopic(s)
MonAugust 25Welcome
WedAugust 27Introduction
WedAugust 29Web Crawling 1
WedSeptember 1 Paper: Crawling the Hidden Web (.ppt)
Web Crawling 2
WedSeptember 3 Papers: Keeping up with the changing Web
The Evolution of the Web and Implications for an Incremental Crawler (.ppt)
FriSeptember 5 Paper: High-Performance Web Crawling
Parallel Crawlers (.ppt)
MonSeptember 8 Web Crawling 3
WedSeptember 10 Parsing/Indexing
FriSeptember 12 Indexing/Evaluation
MonSeptember 15 Evaluation/IR Models
WedSeptember 17 Query Types/Feedback/Index Compression
FriSeptember 19 Finish Indexing
MonSeptember 22 Finish Indexing, Start Clustering
WedSeptember 24 (Class cancelled.)
FriSeptember 26 Review for Exam, Project 2
MonSeptember 19 The Semantic Web (.ppt), presented by Prof. Heflin, and recorded by A. Qasem.
WedOctober 1 Exam #1
FriOctober 3 Exam results, Clustering
MonOctober 6 Clustering (embeddings)
WedOctober 8 Clustering, continued
FriOctober 10 Pacing Break -- no class
MonOctober 13 Clustering, continued
WedOctober 15 Supervised learning
FriOctober 17 Supervised learning, continued
MonOctober 20 Supervised learning, continued
WedOctober 22 (Class cancelled.)
ThuOctober 23 CSE Dept. speaker: Dr. Craig Nevill-Manning of Google Research
FriOctober 24 Semisupervised learning
MonOctober 27 Social Network Analysis
WedOctober 29 Intuitions about eigenvectors
FriOctober 31 Link Analysis, continued
MonNovember 3 Discussed "Improved Algorithms for Topic Distillation in a Hyperlinked Environment", by Bharat and Henzinger, 1998, and "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text", by Chakrabarti et al, 1998.
WedNovember 5 Link nepotism
FriNovember 7 DiscoWeb
Discussion on "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Brin and Page, 1998.
MonNovember 10 Discussion on "Topic-Sensitive PageRank", Haveliwala, 2002.
WedNovember 12 Presentations of The Missing Link (ppt, Yuanbo Guo) and The Intelligent Surfer (ppt, Baoning Wu).
FriNovember 14 Search engine project 3 presentations.
MonNovember 17 Link analysis, measuring and modeling the Web
WedNovember 19 Class cancelled.
FriNovember 21 Second hourly exam
MonNovember 24 Resource Discovery
WedNovember 26 No Class -- Thanksgiving Break
FriNovember 28 No Class -- Thanksgiving Break
MonDecember 1 Review exam 2; Scaling to the Web
WedDecember 3 Paper presentations
The Link Database (Wei) Locality in search engine queries (Kevin)
FriDecember 5 Paper presentations
The Google File System (.ppt, Kris), Scaling Personalized Web Search (Murat)
SatDecember 13 Final project presentations (4-7pm, PL208)
Homework/Projects
HomeworkDueName
Project 4bDecember 5Lehigh Search Engine Project
Project 4aNovember 24Lehigh Search Engine Project
HW 4November 17Link analysis questions
Project 3bNovember 14Lehigh Search Engine Project
Project 3aNovember 3Lehigh Search Engine Project
HW 3October 29Clustering, supervised classification, and semisupervised classification questions
Project 2cOctober 20Lehigh Search Engine Project
Project 2bOctober 13Lehigh Search Engine Project
Project 2aOctober 3Lehigh Search Engine Project
Project 1September 24Lehigh Web Crawler
HW 2September 12Web Crawling Questions
HW 1September 3Search Engine Interface
Announcements
Useful links

This page is http://www.cse.lehigh.edu/~brian/course/searchengines/
Last revised: 16 November 2003.