WWW Search Engines: Algorithms, Architectures and Implementations

Description (Fall 2002)

InstructorProf. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/
Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function.

This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in a class project involving both the creation and management of a large document collection on the WWW. This project will require programming in languages such as Perl/CGI, C/C++, or Java.

Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects.
Prerequisites: CSE 109 Systems Programming
Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval.
Expected Work: Homework, presentations, and group programming projects
Examinations: Midterm and final exam
Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces.
Textbook(s): Understanding Search Engines: Mathematical Modeling and Text Retrieval, Berry and Browne, SIAM (1999); Finding Out About, Belew, Cambridge University Press (2000); Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999).
Syllabus
The course syllabus is available.

Class Notes
DayDateTopic(s)
TueAugust 27Introduction
ThuAugust 29Introduction/Document Preparation
TueSeptember 3Indexing
ThuSeptember 5IR Models
TueSeptember 10IR Models Cont.
ThuSeptember 12Queries
TueSeptember 17Evaluation
Paper: SavvySearch (PowerPoint)
ThuSeptember 19WWW Hypertext
Papers: Silk from a Sow's Ear, ParaSite
TueSeptember 24 Paper: Topical Locality in the Web (PowerPoint)
Search Engine Scaling
ThuSeptember 26 Search Engine Scaling (cont.)
Papers: On Caching Search Engine Query Results
Rank-Preserving Two-Level Caching for Scalable Search Engines (PowerPoint)
TueOctober 1 Search Engine Scaling (cont.)
Papers: Locality in search engine queries and its implication for caching (PowerPoint)
Lessons from Giant-Scale Services (PowerPoint)
TueOctober 8 Midterm review
Paper: Server-side design principles for scalable internet systems
ThuOctober 10 Midterm Exam
TueOctober 15 Project 2
Midterm solutions
WUME search engine development: Part I (PowerPoint) Part II
ThuOctober 17 Parallel IR, Web Crawling
Paper: Parallel Crawlers (PowerPoint)
TueOctober 22 Crawling, continued
Papers: High-Performance Web Crawling (PowerPoint)
Design and Implementation of a High-Performance Distributed Web Crawlers (PowerPoint)
ThuOctober 24 Crawling the changing Web
Papers: UbiCrawler: A Scalable Fully Distributed Web Crawler
TueOctober 29 Paper: Crawling the Hidden Web
Project presentations: David Deschenes, Kalyan Boggavarapu
ThuOctober 31 WWW Link Analysis
TueNovember 5 Link Analysis, cont.
Project 2 presentations
ThuNovember 7 Link Analysis, cont.
Papers: Topic Distillation
Automatic Resource Compilation
TueNovember 12 Link Analysis, cont.
Papers: The Connectivity Server
Inferring Web Communities from Link Topology
ThuNovember 14 Link Analysis, cont.
Papers: Trawling the Web for emerging cyber-communities
Finding related pages in the world-wide Web
TueNovember 19 Link Analysis, cont.
Papers: Focused Crawling
Finding Authorities and Hubs From Link Structures on the World Wide Web
Topic-Sensitive PageRank
ThuNovember 21 Implementations
Papers: Finding What People Want: Experiences with the WebCrawler
Lycos: Design Choices in an Internet Search Service
AltaVista ranking of query results
TueNovember 26 Implementations, cont.
Papers: The Anatomy of a Large-Scale Hypertextual Web Search Engine
DiscoWeb: Applying Link Analysis to Web Search
Searching the Web
TueDecember 3 Search Engine Manipulation
Papers: Recognizing Nepotistic Links on the Web
When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics
ThuDecember 5 Final class
Project 3 Presentations
Homework/Projects
HomeworkDueName
Project 3bDecember 3Lehigh Search Engine Project
HW 4November 21Critique of link analysis papers
Project 3aNovember 14Lehigh Search Engine Project
Project 2bOctober 31Lehigh Search Engine Project
Project 2aOctober 22Lehigh Search Engine Project
HW 3October 17Critique of crawling papers
Project 1cOctober 8Lehigh Search Engine Project
HW 2September 26Critique of caching papers
Project 1bSeptember 26Lehigh Search Engine Project
Project 1aSeptember 19Lehigh Search Engine Project
HW 1September 10Search Engine Interface
Announcements
Useful links

This page is http://www.cse.lehigh.edu/~brian/course/searchengines/
Last revised: 4 December 2002.