Description (Fall 2003)
Syllabus
Instructor Prof. Brian D. Davison
davison(at)cse.lehigh.edu
http://www.cse.lehigh.edu/~brian/Introduction: With billions of addressable documents publicly accessible, WWW search engines continue to be fundamental to information seeking on the Web. The scale of these engines, both in content and in access make the algorithms, architectures, and implementations of these systems challenging. This course is designed for upper-level undergraduates and graduate students interested in learning how Web search engines function. This course focuses on the technologies for storing and retrieving hypertext from large databases. Particular emphasis is given to the data structures and algorithms needed to build efficient search engines for the World Wide Web (WWW). Topics covered include: information retrieval (IR) models, performance evaluation, query languages and operations, properties of hypertext, crawling, indexing, searching, ranking, link analysis, parallel and distributed IR, and user interfaces. Students will participate in a class project involving both the creation and management of a large document collection on the WWW. This project will require programming in languages such as Perl/CGI, C/C++, or Java.
Objectives: To provide a practical understanding of the design and implementation of modern WWW search engines. This objective is accomplished through a combination of lectures, discussion and analysis of published papers, and extensive hands-on programming projects. Prerequisites: CSE 109 Systems Programming or graduate status Recommended: One or more courses in networking, software engineering, operating systems, databases, numerical analysis, or information retrieval. Expected Work: Homework, presentations, and group programming projects Examinations: Two hourly midterms (no final exam) Course catalog description: Study of algorithms, architectures, and implementations of WWW search engines. Information retrieval (IR) models; performance evaluation; query languages and operations; properties of hypertext; Web crawling, indexing, searching and ranking; link analysis; parallel and distributed IR; user interfaces. Textbook(s): Mining the Web: Discovering Knowledge from Hypertext Data, Chakrabarti (2003); Modern Information Retrieval, Baeza-Yates and Ribeiro-Neto, Addison Wesley (1999).
The course syllabus is available.Class Notes
Homework/Projects
Day Date Topic(s) Mon August 25 Welcome Wed August 27 Introduction Wed August 29 Web Crawling 1 Wed September 1 Paper: Crawling the Hidden Web (.ppt)
Web Crawling 2Wed September 3 Papers: Keeping up with the changing Web
The Evolution of the Web and Implications for an Incremental Crawler (.ppt)Fri September 5 Paper: High-Performance Web Crawling
Parallel Crawlers (.ppt)Mon September 8 Web Crawling 3 Wed September 10 Parsing/Indexing Fri September 12 Indexing/Evaluation Mon September 15 Evaluation/IR Models Wed September 17 Query Types/Feedback/Index Compression Fri September 19 Finish Indexing Mon September 22 Finish Indexing, Start Clustering Wed September 24 (Class cancelled.) Fri September 26 Review for Exam, Project 2 Mon September 19 The Semantic Web (.ppt), presented by Prof. Heflin, and recorded by A. Qasem. Wed October 1 Exam #1 Fri October 3 Exam results, Clustering Mon October 6 Clustering (embeddings) Wed October 8 Clustering, continued Fri October 10 Pacing Break -- no class Mon October 13 Clustering, continued Wed October 15 Supervised learning Fri October 17 Supervised learning, continued Mon October 20 Supervised learning, continued Wed October 22 (Class cancelled.) Thu October 23 CSE Dept. speaker: Dr. Craig Nevill-Manning of Google Research Fri October 24 Semisupervised learning Mon October 27 Social Network Analysis Wed October 29 Intuitions about eigenvectors Fri October 31 Link Analysis, continued Mon November 3 Discussed "Improved Algorithms for Topic Distillation in a Hyperlinked Environment", by Bharat and Henzinger, 1998, and "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text", by Chakrabarti et al, 1998. Wed November 5 Link nepotism Fri November 7 DiscoWeb
Discussion on "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Brin and Page, 1998.Mon November 10 Discussion on "Topic-Sensitive PageRank", Haveliwala, 2002.
Wed November 12 Presentations of The Missing Link (ppt, Yuanbo Guo) and The Intelligent Surfer (ppt, Baoning Wu). Fri November 14 Search engine project 3 presentations. Mon November 17 Link analysis, measuring and modeling the Web Wed November 19 Class cancelled. Fri November 21 Second hourly exam Mon November 24 Resource Discovery Wed November 26 No Class -- Thanksgiving Break Fri November 28 No Class -- Thanksgiving Break Mon December 1 Review exam 2; Scaling to the Web Wed December 3 Paper presentations
The Link Database (Wei) Locality in search engine queries (Kevin)Fri December 5 Paper presentations
The Google File System (.ppt, Kris), Scaling Personalized Web Search (Murat)Sat December 13 Final project presentations (4-7pm, PL208)
Announcements
Homework Due Name Project 4b December 5 Lehigh Search Engine Project Project 4a November 24 Lehigh Search Engine Project HW 4 November 17 Link analysis questions Project 3b November 14 Lehigh Search Engine Project Project 3a November 3 Lehigh Search Engine Project HW 3 October 29 Clustering, supervised classification, and semisupervised classification questions Project 2c October 20 Lehigh Search Engine Project Project 2b October 13 Lehigh Search Engine Project Project 2a October 3 Lehigh Search Engine Project Project 1 September 24 Lehigh Web Crawler HW 2 September 12 Web Crawling Questions HW 1 September 3 Search Engine Interface