EzSearch: Design Document

CSE 445 – WWW Search Engines - Spring 2007 – Project 2

Authors: Fabiana Prabhakar, Shruti Bhandari & Suyun Gu

 

 

1.    Introduction

 

Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and will present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first. Our second project consists of extending the functionality of the indexer (project1) and incorporating text-based multi-term retriever and ranking system with a Web (CGI) interface. The remainder of this document is organized as follows: Section 2 presents the indexed data that will be used for retrieval. Section 3 lists the resources chosen for the implementation. Section 4 describes EzSearch architecture. Section 5 lists the changes to the indexer in order to support ranking. Section 6 illustrates how the team is organized. Section 7 lists the documentation and Section 8 lists the references.

 

 

2.    Data

 

We will use the data parsed and indexed by the project lswww, which in turn used a subset of documents from a recent crawl of the UK. The indexed data can be found at /proj/searchengines2/project1/lswww/words.

 

 

3.    Resources

 

There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.

 

 

 

3.1. Java

 

Java will be the core language for our implementation. Our choice was mainly based on the fact that Java was the language used for lswww, the implementation we are going to build upon. We also considered our group’s past developing experience in Java as a factor in our decision.

  

 

3.2. Perl

 

For the web interface implementation, we decided that Perl would be the best language for that task as a lot of online support is available for the PERL-CGI script and also, we have a basic interface implemented already as a part of Homewok1. 

The CGI basically displays an editable field where the user will type the query. It also displays the results received from the server.

 

4.    lswww-2 Architecture

 

4.1. Overall Architecture

 

 

 

4.2. The CGI Interface

 

The CGI is a Perl interface that allows a user to send a query to the search engine and check the results. The query can be composed of multiple terms. The Retriever will be responsible to support multiple terms query (phrases).

 

4.3. Server

 

The server receives a request from the client (CGI) and calls the retriever module. The retriever will retrieve the documents and use the ranker module to rank the results. The results are sent back to the client.

 

 

4.4. Retriever

 

The retriever module will call the ranking module to see which pages have the highest weights, or to see which pages have exact phrase matches. The top 100 of these results will be ordered and sent to the server.

 

 

4.5. Ranker

 

After a considerable discussion, we concluded that we would go in with the BM25F (BM-Best Match) ranking function. The reason for this decision was that we wanted to make use of the term weights and the term frequency to contribute towards the ranking of the result set of documents relevant for the given query.

 

We will also give an exact phrase match a higher priority then any other ranking if specified by the user. The user may specify that they wish to have an exact phrase match simply by putting quotes around their query. This method will be implemented by first checking if the first word in the query exists in a given document, then checking if the second word in the query exists right after an instance of the first word, and the process will be repeated for every word in the query. If quotes are not used, then this check will not be done and ranking will be done solely on the BM25F method.

 

BM25F [1]

 

BMF25F is an algorithm to improve web search ranking by incorporating user behavior. Document structure (or fields), such as the title, headings and the anchor text have been shown to be effective in Web Information Retrieval. The linear combination of scores, which had been the approach, mostly used for the combination of fields, is difficult to interpret due to the non-linear relation between the scores and the term frequencies in each of the fields. In addition, the length normalization that should be applied to each field depends on the nature of the field. Zaragoza et al.[1] introduced a field-based version of BM25, called BM25F, which applies length normalization and weighting of the fields independently. This is an extension of BM25.

 

The indexer will be extended so that the BM25F values for each term will be pre-calculated during the indexing process.

 

 

5.    lswww

 

lswww is the implementation we are going to build upon. We have chosen this indexer because it appears to be the most robust one out of the four available, as well as the fact that it is modularized so it will be easy to make modifications to it. It also already supports many of the capabilities that are needed, such as weighting. lswww is also known to be capable of performing the indexing on at least one of the compressed files without any flaws or crashes.

 

 

The main features that will be added to the lswww indexing are:

 

·        Creation of a Links File: used to provide inlink/outlink search based on the anchor text information.

·        Adjust weights calculation that is stored in the postings list for further use by the Ranker.

·        Adding the BM25F values to the postings list.

 

 

6.    Assignment of Tasks

 

 

All the members will participate in all the modules. Every module has a focal point: the member who is mainly responsible for making sure the tasks are performed and the best results are achieved.

Every member will closely participate in the ranking implementation, what we consider the main module for this project.

 

Fabiana Prabhakar

·        Team Leader;

·        Focal point for: client and server.

 

Shruti Bhandari

·        Focal point for: documentation and ranking

 

Suyun Gu

·        Focal point for: javadoc and retrieval.

·        Special focus on the challenges of phase searching.

 

7.    Documentation

 

7.1. Design document

 

http://www.cse.lehigh.edu/~sug2/ezsearch

 

 

7.2. Implementation document

 

Future documentation.

 

 

8.    References

 

[1]   S. Robertson, H. Zaragoza and M. Taylor. “Simple BM25 Extension to multiple Weighted Fields”. CIKM’04, November 8-13, 2004, Washington, DC, USA.

 

 

 

9.    Authors’ Signatures

 

 

________________                                _________________

 Fabiana Prabhakar                                      Shruti Bhandari

 

 

________________                     

                                             Suyun Gu