Sample student-generated questions for Spring 2007. =================================================== 1 What is the main difference between the Intelligent Surfer and the original PageRank algorithm? 2 What is the difference between the Intelligent Surfer and the topic-sensitive page rank? 3 What were the three problems that Bharat and Henzinger encountered with Kleinberg's algorithm? 4 What is a clique attack? 5 Describe at least 5 of the benefits gained from in-depth analysis of a page, instead of simply indexing the words. 6 What is a hub and an authority? 7 What is web spamming? Describe how a webmaster of an online business might try to exploit search engines, and two ways the search engine can respond. 8 What were some of the new ideas Brin and Page introduced in their paper about their new search engine, Google? 9 Explain the main difference between the approach described in "Topic-Sensitive Pagerank" and the one described in "Combining Link and Content Information in Web Search." 10 Name some of the features (at least 5) detailed in "Detecting Spam Web Pages through Content Analysis." Provide an explanation if necessary. 11 Name some features used to distinguish spam web pages and the reasoning behind them, from Detecting Spam Web Pages through Content Analysis 12 Describe the general architecture of the Google Filing System 13 What is a link-based ranking strategy? Why are link-based ranking strategies used? 14 What is search engine spamming? What are some ways of combating search engine spamming? 15 What is the problem with the traditional web link analysis used in the search engines. How, according to Lan Nie et al (in the paper Topical Link Analysis for Web Search), does the topical link analysis help in overcoming this problem? 16 Mention 2 applications of the topical model of link analysis. 17 How is Topic Sensitive PageRank (TSPR) different from Topical link analysis model used in PageRank? 18 Describe the method used to compute PageRank and one modification that has been presented in class. 19 Describe the user-level paradigm and the run-time library process of MapReduce. 20 Explain the two major systematic elements of a GFS Cluster and their roles. 21 On a GFS Cluster explain how data flow is decouple from data control and explain why this is beneficial. 22 What are the differences between PageRank and HITS? 23 List several ways to avoid two-party nepotism. 24 Explain the probabilistic interpretation of Query-Sensitive PageRank in terms of the Random Surfer Model 25 What are the shortcomings of ranking algorithms PageRank and HITS 26 What method does Google employ for saving space while storing forward and inverted indices? ["Anatomy of a Large-Scale Hypertextual Web Search Engine" -- Sergey Brin and Lawrence Page] 27 What are the two metrics used to measure the degree of change for a page? ["What's new on the web?" -- Alexandras Ntoulas, Junghoo Cho and Christopher Olston] 28 What's the advantage of using BM25F(without usual RSJ relevance weight for term) over TFIDF for evaluating the importance of a document? 29 The two most important ranking algorithms used in search engine are the term frequency based importance score and linkage based PageRank like algorithm. What's the drawback of each algorithm? What's your recommendation for improvement, if there is any? 30 Compare and contrast HITS-based link analysis algorithms and PageRank-based link analysis algorithms. 31 Describe one major difference between standard file systems and the Google file system as described in the presented paper. ========================================================================== Additional sample exam questions. ================================= 1 Mathematically define Page and Brin's PageRank 2 Mathematically define Kleinberg's HITS method to calculate hubs and authories 3 Define nepotistic links. Do nepotistic links represent a problem? Why or why not? 4 How does web page classification differ from traditional text classification? 5 HITS and PageRank have each been modified in futher work. For each of the link analysis algorithms below, list important differences to the algorithm on which it is based. a) Haveliwala's Topic-Sensitive PageRank b) Chakrabarti et al.'s ARC (Automatic Resource Compilation) c) Bharat and Henzinger's Topic Distillation work d) Davison et al.'s DiscoWeb 6 What are the basic (and often unstated) assumptions of link analysis? 7 What is the random surfer model for PageRank?