Description (Fall 2002)
Syllabus
Instructor Prof. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function. This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in a class project involving both the creation and management of a large document collection on the WWW. This project will require programming in languages such as Perl/CGI, C/C++, or Java.
Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects. Prerequisites: CSE 109 Systems Programming Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval. Expected Work: Homework, presentations, and group programming projects Examinations: Midterm and final exam Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces. Textbook(s): Understanding Search Engines: Mathematical Modeling and Text Retrieval, Berry and Browne, SIAM (1999); Finding Out About, Belew, Cambridge University Press (2000); Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999).
The course syllabus is available.Class Notes
Homework/Projects
Day Date Topic(s) Tue August 27 Introduction Thu August 29 Introduction/Document Preparation Tue September 3 Indexing Thu September 5 IR Models Tue September 10 IR Models Cont. Thu September 12 Queries Tue September 17 Evaluation
Paper: SavvySearch (PowerPoint)Thu September 19 WWW Hypertext
Papers: Silk from a Sow's Ear, ParaSiteTue September 24 Paper: Topical Locality in the Web (PowerPoint)
Search Engine Scaling
Thu September 26 Search Engine Scaling (cont.)
Papers: On Caching Search Engine Query Results
Rank-Preserving Two-Level Caching for Scalable Search Engines (PowerPoint)
Tue October 1 Search Engine Scaling (cont.)
Papers: Locality in search engine queries and its implication for caching (PowerPoint)
Lessons from Giant-Scale Services (PowerPoint)
Tue October 8 Midterm review
Paper: Server-side design principles for scalable internet systems
Thu October 10 Midterm Exam Tue October 15 Project 2
Midterm solutions
WUME search engine development: Part I (PowerPoint) Part II
Thu October 17 Parallel IR, Web Crawling
Paper: Parallel Crawlers (PowerPoint)
Tue October 22 Crawling, continued
Papers: High-Performance Web Crawling (PowerPoint)
Design and Implementation of a High-Performance Distributed Web Crawlers (PowerPoint)
Thu October 24 Crawling the changing Web
Papers: UbiCrawler: A Scalable Fully Distributed Web Crawler
Tue October 29 Paper: Crawling the Hidden Web
Project presentations: David Deschenes, Kalyan Boggavarapu
Thu October 31 WWW Link Analysis Tue November 5 Link Analysis, cont.
Project 2 presentationsThu November 7 Link Analysis, cont.
Papers: Topic Distillation
Automatic Resource Compilation
Tue November 12 Link Analysis, cont.
Papers: The Connectivity Server
Inferring Web Communities from Link Topology
Thu November 14 Link Analysis, cont.
Papers: Trawling the Web for emerging cyber-communities
Finding related pages in the world-wide Web
Tue November 19 Link Analysis, cont.
Papers: Focused Crawling
Finding Authorities and Hubs From Link Structures on the World Wide Web
Topic-Sensitive PageRank
Thu November 21 Implementations
Papers: Finding What People Want: Experiences with the WebCrawler
Lycos: Design Choices in an Internet Search Service
AltaVista ranking of query results
Tue November 26 Implementations, cont.
Papers: The Anatomy of a Large-Scale Hypertextual Web Search Engine
DiscoWeb: Applying Link Analysis to Web Search
Searching the Web
Tue December 3 Search Engine Manipulation
Papers: Recognizing Nepotistic Links on the Web
When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics
Thu December 5 Final class
Project 3 Presentations
Announcements
Homework Due Name Project 3b December 3 Lehigh Search Engine Project HW 4 November 21 Critique of link analysis papers Project 3a November 14 Lehigh Search Engine Project Project 2b October 31 Lehigh Search Engine Project Project 2a October 22 Lehigh Search Engine Project HW 3 October 17 Critique of crawling papers Project 1c October 8 Lehigh Search Engine Project HW 2 September 26 Critique of caching papers Project 1b September 26 Lehigh Search Engine Project Project 1a September 19 Lehigh Search Engine Project HW 1 September 10 Search Engine Interface