WWW Search Engines Fall 2004 Sample student-submitted questions for exam #1. =============================================== [Questions marked with an * are considered good ones by the instructor.] 1) What is an effective way of dealing with a subset of the corpus that is dynamic? 2) [*]Explain Zipf's law. 3) Given the data on sheet #1 of the interpolated-precision excel spreadsheet example provided in the lecture (k, relevancy), calculate the 11-point average precision for the search engine results. 4) What does tf-idf stand for, and in what context is it referenced? 5) Name the three most general types of queries. Describe two ways a search system can facilitate refinement of any of these queries. 6) [*]Describe the difference between bottom-up clustering and top-down clustering. 7) Given the following data (data similar to the excel spreadsheet), fill in the 11-point interpolation chart. 8) What is the difference between data retrieval (database) and information retrieval (e.g., web search)? 9) [*]Why are trie structures useful in search engines? 10) [*]What is stemming and what are the advantages and disadvantages of using it in a search engine? 11) [*]Briefly explain the purpose of Elias's Gamma Code, and use it to encode the following gap numbers: 1, 9, 10, 21 12) [*]What are stopwords, and how are they treated by IR systems? 13) [*]What are two general approaches to text compression? 14) [*]What is user relevance feedback? What are its advantages? 15) [*]Build a suffix tree for the terms "pearl" and apple". 16) Given a particular search engine result ranking (with relevancy judgements), determine the three-point average interpolated precision. 17) [*]What is the difference between hard clustering and soft clustering and when is one more useful than the other? 18) What are the fundamental differences between recall and precision and which is a more desirable characteristic? 19) [*]What is the difference between polysemy and synonymy, and why are these issues problematic for term-based IR techniques? 20) What are the three classes of classic or traditional IR models, and how do they compare (which is the best, which is the worst, and why)? 21) [Multiple parts] a) What's the goal of clustering in the field of search engine? b) [*]Consider 2 partition approach HAC and K-means, what's the difference between them? c) Generate dendrogram tree for a given clustering of documents. d) Using K-mean to cluster the a given set of documents (points in a space).