WWW Search Engines Schedule (2007)

Class Notes (OpenOffice source files available upon request.)
Credits: Significant portions of these notes are derived from others, including Richard Belew, Soumen Chakrabarti, Mark Levene, and Ganesh Ramakrishnan. Thanks!
Key: IIR= Introduction to Information Retrieval; Levene=An Introduction to Search Engines and Web Navigation; MIWch4=Modeling the Internet and the Web, chapter 4; MIR=Modern Information Retrieval; MtW=Mining The Web; USE=Understanding Search Engines; FOA=Finding Out About; MG=Managing Gigabytes; IR=Information Retrieval.
DayDateTopic(s) Required Readings Suggested Readings
MonJan 15Welcome IIR 1
  • Finding What People Want: Experiences with the WebCrawler, Pinkerton, 1994
  • Lycos: Design choices in an internet search service, Maudlin, 1997
  • Levene 1; MIR 1; MtW 1; FOA 1; IR 1
    WedJan 17Overview  
  • As we may think, Vannevar Bush, 1945
  • FriJan 19Evaluation IIR 8 Levene 2, 5.4; MIR 3; MtW 3.2.1; FOA 4.3; MG 4.5; USE 6.1; IR 7
  • SavvySearch, Howe and Dreilinger, 1997
  • MonJan 22 Finish Evaluation; Overview of Indexing Process   Levene 4.1-4.6
    WedJan 24 Text Preparation IIR 2 Levene 5.1; MIR 6.1-6.2, 7.1-7.2; USE 2.1-2.4; FOA 2.2-2.4; MG 3.7
    FriJan 26Indexing IIR 4 MIR 8.1-8.3; MtW 3; FOA 4; MG 3.1-3.2, 3.5-3.6; USE 2.5
    MonJan 29Inverted Indices  
    WedJan 31Vector-space model; Zipf's law; term weightingsIIR 7MIR 2.1-2.5; MG 4.4; USE 3
    FriFeb 02Compression IIR 5; MIWch4 4.1-4.3
  • AltaVista ranking of query results, van Eylen, 1998.
  • MIR 7.4-7.6; MtW 3.1.3; MG 3.3-3.4, 2, 5, 9
    MonFeb 05Finish compression; Queries; Feedback IIR 9 Levene 6.4.3; MIR 4, 5, 10.5; FOA 4.2; MG 4.2-4.3; USE 5, 6.2
    WedFeb 07Finish Indexing    
    FriFeb 09Clustering IIR 16-18; MIWch4 4.5, 4.8 MIR 5.3.1, 2.7.2; MtW 4; MG 4.6
    MonFeb 12Clustering/Dimension Reduction    
    WedFeb 14 Guest speaker: Dr. Marc Najork, Microsoft Research (Maginnes 101)   
    FriFeb 16 LSI  
  • Indexing by Latent Semantic Analysis, Deerwester et al., 1990
  • MonFeb 19 PLSI
  • Probabilistic Latent Semantic Indexing, Hoffman, 1999. Required only for 445 students.
  • Levene 6.1
  • Informed Projections, Cohn, 2002.
  • WedFeb 21 Collaborative Filtering, Supervised Learning IIR 13-15 Levene 3, 6.1, 9.4; MIR 2.8; MtW 5; MIWch4 4.6
  • Amazon.com recommendations: item-to-item collaborative filtering, Linden, Smith, and York, 2003.
  • Clustering Methods for Collaborative Filtering, Ungar and Foster, 1998.
  • FriFeb 23Supervised Learning  
    MonFeb 26 Bayesian Classification, Review Sample Exam 1 Questions  
    WedFeb 28 Short project presentations, Bayesian Networks, Discriminative Classifiers  
    FriMar 2 Finish project presentations   
    MonMar 05NO CLASS - Spring Break  
    WedMar 07NO CLASS - Spring Break  
    FriMar 09NO CLASS - Spring Break  
    MonMar 12 Discriminative classifiers, Semi-supervised learning
  • Silk from a Sow's Ear: Extracting Usable Structures from the Web, Pirolli, Pitkow, and Rao, 1996 (ACM PDF)
  • ParaSite: Mining Structural Information on the Web, Spertus, 1997 (PDF)
  • MtW 6
  • Knowing a Web Page by the Company It Keeps, Qi and Davison, 2006.
  • WedMar 14Hourly Exam  
    FriMar 16 Start Social Networks
  • Content and Link Structure Analysis for Searching the Web, Efe, Raghavan, and Lakhotia, 2004.
  • Authoritative Sources in a Hyperlinked Environment, Kleinberg, 1997-1999.
  • The Anatomy of a Large-Scale Hypertextual Web Search Engine, Brin and Page, 1998
  • Levene 5.2, 6.4.4, 9.1, 9.2; MtW 7; USE 7
  • The PageRank Citation Ranking: Bringing Order to the Web, Page et al., 1998
  • Finding Related Pages in the World Wide Web, Dean and Henzinger, 1999.
  • MonMar 19 Review Exam 1
    Continue with social networks - PageRank
     
  • Learning to Probabilistically Identify Authoritative Documents, Cohn and Chang, 2000.
  • What is this Page Known for? Computing Web Page Reputations, Rafiei and Mendelzon, 2000.
  • SALSA: The Stochastic Approach for Link-Structure Analysis, Lempel and Moran, 2001.
  • SimRank: A Measure of Structural-Context Similarity, Jeh and Widom, 2003
  • Searching the Web, Arasu et al., 2001
  • Finding Authorities and Hubs From Link Structures on the World Wide Web, Borodin et al., 2001.
  • Link Analysis, Eigenvectors and Stability, Ng et al., 2001.
  • When Experts Agree: Using Non-Affiliated Experts to Rank Popular Topics, Bharat and Mihaila, 2001
  • Ranking the Web Frontier, Eiron et al., 2004
  • WedMar 21Link Analysis: HITS
  • Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Bharat and Henzinger, 1998
  • Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Chakrabarti et al., 1998.
  • DiscoWeb: Applying Link Analysis to Web Search, Davison et al., 1999
  •  
    FriMar 23 Discuss "Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text", by Chakrabarti et al, 1998.   
    MonMar 26 Discuss "Improved Algorithms for Topic Distillation in a Hyperlinked Environment", by Bharat and Henzinger, 1998
    Link nepotism
  • Recognizing Nepotistic Links on the Web, Davison, 2000
  •  
    WedMar 28 Discuss "The Anatomy of a Large-Scale Hypertextual Web Search Engine", Brin and Page, 1998
    Paper presentation (Kar and Wang): "Topic-Sensitive PageRank"
  • Topic-Sensitive PageRank, Haveliwala, 2003. (Shorter, original conference version, 2002)
  • The Missing Link - A Probabilistic Model of Document Content, by Cohn and Hofmann, 2001. (Required only for 445 students.)
  •  
    FriMar 30Paper presentation (Prabhakar and Smith): "Combining Link and Content Information in Web Search"
    Guest presentation (Lan Nie): "Topical Link Analysis for Web Search"
  • Combining Link and Content Information in Web Search, Richardson and Domingos, 2004 (Original conference version: The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank, 2002)
  • Topical Link Analysis for Web Search, Nie, Davison, and Qi, 2006.  
  •  
    MonApr 02 Project 2 presentations    
    WedApr 04 Last project 2 presentation
    Paper presentation (Moukhine and Wojciechowski): "Detecting Spam Web Pages through Content Analysis"
  • Detecting Spam Web Pages through Content Analysis, Ntoulas et al., 2006
  • What's New on the Web? The Evolution of the Web from a Search Engine Perspective, Ntoulas et al., 2004
  • Levene 5.3, 9.6; MtW 7
  • Graph structure in the web, Broder et al., 2000
  • Topical Locality in the Web, Davison, 2000
  • Sic Transit Gloria Telae: Towards an Understading of the Web's Decay, Bar-Yossef et al., 2004
  • FriApr 06 Link analysis
    Paper presentation (Deak and Bhandari): "What's new on the Web?"
     MtW 8
  • Inferring Web Communities from Link Topology, Gibson et al, 1998.
  • Focused crawling: a new approach to topic-specific Web resource discovery, Chakrabarti et al., 1999.
  • Trawling the web for emerging cyber-communities, Kumar et al., 1999.
  • MonApr 09 Modeling the Web
    Paper presentation (Moukhine and Wojciechowski): "The Google File System"
  • The Link Database: Fast Access to Graphs of the Web, Randall et al., 2001.
  • The Google File System, Ghemawat et al., 2003.
  • MapReduce: Simplified Data Processing on Large Clusters, Dean and Ghemawat, 2004.
  • The Connectivity Server: fast access to linkage information on the Web, Bharat et al., 1998.
  • WedApr 11 Paper presentation (Brendan Melville): "MapReduce: Simplified Data Processing on Large Clusters"
    Review Sample Exam 2 Questions
    DiscoWeb
      
    FriApr 13No class -- inauguration of university president  
    MonApr 16Hourly Exam  
    WedApr 18Resource Discovery
  • Scaling Personalized Web Search, Jeh and Widom, 2003.
  • Levene 4.7;
  • A Web caching primer, Davison, 2001
  • Lessons from Giant-Scale Services, Brewer, 2001. Draft IEEE published version
  • Locality in search engine queries and its implication for caching, Xie and O'Hallaron, 2002
  • Efficient Computation of PageRank, Haveliwala, 1999.
  • On Caching Search Engine Query Results, Markatos, 2000
  • Rank-preserving two level caching for scalable search engines, Saraiva et al., 2001
  • Server-side design principles for scalable internet systems, Roe and Gonik, 2002 Local Copy
  • Building a Distributed Full-Text Index for the Web, Melnik et al., 2000.
  • Optimized Query Execution in Large Search Engines with Global Page Ordering, Long and Suel, 2003.
  • ODISSEA: A Peer-to-Peer Architecture for Scalable Web Search and Information Retrieval, Suel et al., 2003.
  • Optimizing Result Prefetching in Web Search Engines With Segmented Indices, Lempel and Moran, 2002.
  • Extrapolation Methods for Accelerating PageRank Computations, Kamvar et al., 2003.
  • Exploiting the Block Structure of the Web for Computing PageRank, Kamvar et al., 2003.
  • Adaptive Methods for the Computation of PageRank, Kamvar et al., 2003.
  • An Analytical Comparison of Approaches to Personalizing PageRank, Haveliwala et al., 2003.
  • Mining the Space of Graph Properties, Jeh and Widom, 2003.
  • The Second Eigenvalue of the Google Matrix, Haveliwala and Kamvar, 2003.
  • The WebGraph Framework I: Compression Techniques, Boldi and Vigna, 2004
  • Web search for a planet: The Google cluster architecture, Barroso et al., 2003.
  • Failure Trends in a Large Disk Drive Population, Pinheiro, Weber and Barroso, 2007.
  • FriApr 20 Scaling to the Web
    Paper presentation (Wang and Kar): "Scaling Personalized Web Search"
  • Three-Level Caching for Efficient Query Processing in Large Web Search Engines, Long and Suel, 2005
  •  
    MonApr 23 Paper presentation (Smith and Prabhakar): "Three-Level Caching for Efficient Query Processing in Large Web Search Engines"
    Paper presentation (Bhandari and Deak): "Crawling the Hidden Web"
  • Crawling the Hidden Web, Raghavan and Garcia-Molina, 2001
  • Levene 4.6
    WedApr 25 Web Crawling   MIR 13.4.5; MtW 2
  • Keeping up with the changing Web, Brewington and Cybenko, 2000
  • The Evolution of the Web and Implications for an Incremental Crawler, Cho and Garcia-Molina, 2000
  • High-Performance Web Crawling, Najork and Heydon, 2001
  • Parallel Crawlers, Cho and Garcia-Molina, 2002
  • Design and Implementation of a High-Performance Distributed Web Crawler, Shkapenyuk and Suel, 2001
  • UbiCrawler: A Scalable Fully Distributed Web Crawler, Boldi et al., 2002
  • FriApr 27 Finish Crawling   
    WedMay 24-7pm Final Presentations in Maginnes 102  

    This page is http://www.cse.lehigh.edu/~brian/course/2007/searchengines/schedule.html
    Last revised: B. Davison, 26 April 2007.