EzSearch: Implementation Document

CSE 445 – WWW Search Engines - Spring 2007 – Project 2

Authors: Fabiana Prabhakar, Shruti Bhandari & Suyun Gu

 

 

1.   Introduction

 

Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and will present you with ones that match your query. The matches will even be ranked, so that the most relevant ones come first. Our second project consists of extending the functionality of the indexer (project1) and incorporating text-based multi-term retriever and ranking system with a Web (CGI) interface. The remainder of this document is organized as follows: Section 2 presents the indexed data that will be used for retrieval. Section 3 lists the resources chosen for the implementation. Section 4 describes EzSearch architecture. Section 5 lists the changes to the indexer in order to support ranking. Section 6 provides the commands required to run the system. Section 7 illustrates how the team is organized. Section 8 lists the documentation and Section 9 lists the references.

 

 

2.   Data

 

We will use the data parsed and indexed by the modified version of the project lswww, which in turn uses a subset of documents from a recent crawl of the UK. The indexed data can be found at /proj/searchengines2/project2/EzSearch/words. We were able to parse and index the terms from 85000 documents.

 

 

3.   Resources

 

There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.

 

 

 

3.1. Java

 

Java will be the core language for our implementation. Our choice was mainly based on the fact that Java was the language used for lswww, the implementation we are going to build upon. We also considered our group’s past developing experience in Java as a factor in our decision.

  

3.1.1      Libraries

 

·      Tidy.jar: transform an HTML file into a well formed HTML file.

 

 

3.2. Perl

 

For the web interface implementation, we decided that Perl would be the best language for that task as a lot of online support is available for the PERL-CGI script and also, we have a basic interface implemented already as a part of Homework1. The basic interface was properly enhanced in order to implement the server protocol. 

The CGI basically displays an editable field where the user will type the query. It also displays the results received from the server. See Figure 1for more details.

 


 

4.   EzSearch Architecture

 

4.1. Overall Architecture

 

Figure 1 - EzSearch Overall Architecture

 

4.2. Persistent Storage

 

The index is stored on the disk as follows.

 

4.2.1. Posting List Structure

 

docID

d'

Positions

1

10

pos1,pos2,pos3

2

20

pos1

3

11

pos1,pos2

4

12

pos5

 

This structure will be implemented using a 3-level HashMap, docID as the key for the outermost HashMap, term as the key for the 2nd level HashMap, and the innermost HashMap containing the posting list for the term that will be mapped using the field ID as the key.

This information will be periodically flushed to disk. For each term, a file is created following lswww[2] implementation.

 

4.2.2. Documents List Structure

 

docID

URL

Title

dl

1

http://java.sun.com/j2se/1.3/docs/api/

Title 1

100

2

http://www.lehigh.edu

Title 2

51

 

The documents list will hold the list of URLs processed by the parser.

This information will be periodically flushed into the file doc_index.txt.

 

4.2.3. Inlinks File

 

Source Document

Target Document

1

2

2

3

 

The file inlinks.txt is generated in order to store the page’s inlinks. It is used during inlinks retrieval for a document.

 

4.2.4. Outlinks File

 

Target Document

Source Document

2

1

3

2

 

The file outlinks.txt is generated in order to store the pages outlinks. It is used during outlinks retrieval for a document.

 

4.2.5. Average Document Length File

 

The avdl file is generated by the indexer. After parsing every document, the document length is accumulated and at the end of the entire parsing, this sum is divided by the total number of documents parsed to obtain the average document length.

 

4.2. The CGI Interface

 

The CGI is a Perl interface that allows a user to send a query to the search engine and check the results.

 

Figure 2 - CGI Interface

 

There are four kinds of possible queries:

·      Multi-term: query composed of one or more terms where the order that the terms appear is not relevant. In order to rank these queries, we are using the BM25F algorithm.

·      Phrase: query composed of multi terms surrounded by quotes. For such queries we will check term proximity using the positions within documents in order to calculate relevance.

·      Inlinks: returns the pages that contain links to the given URL/docID. This kind of query is answered using the inlinks index.

·      Outlinks: returns the pages that the given URL/docID points to. This kind of query is answered using the outlinks index.

 

The message sent to the server has the following format:

<query_type> <query>

 

 

4.3. Server

 

The server keeps an offset to the documents list in memory using a HashMap where the key is the docID. It also contains a second HashMap in order to be able to translate a URL to a docID during the inlink/outlink search.

The server receives a request from the client (CGI) and calls the retriever module. If the search is for in/outlinks, the retriever accesses the in/outlinks files and send the results to the client. If the search is a multi-term or a phrase search, the retriever will retrieve the documents and use the ranker module to rank the results. The results are sent back to the client.

 

The response sent to the client has the following format:

<URL1>;<Title1>\n<URL2>;<Title2>\n…<URLN>;<TitleN>

 

 

4.4. Retriever

 

The retriever module will call a specific method for each type of query offered by the interface.

 

4.4.1. Phrase Search

 

The user may specify that he/she wishes to have an exact phrase match simply by putting quotes around the query and selecting query type terms. The Ranker is responsible to calculate the proximity ranking for phrase search. Please refer to the ranker implementation for further information.

A HashMap, with docID as the key, is used to store the document list for each term in the query. After the document lists for all the terms are retrieved, the docIDs which do not contain all the terms are removed. The resultant HashMap has only those docIDs which have all the terms. This set is then checked to see if all the terms are in the same sequence and at the same distance from each other as in the query. The documents with a mismatch are removed from the list. The final list of docIDs is then sorted using the Ranker.

 

4.4.2. Multi-Terms Search

 

If quotes are not used for query type terms, the ranking will be done solely based on the BM25F method. The BM25F algorithm was implemented within the Ranker Module. Please refer to the ranker implementation for further information.

 

4.4.3. Inlinks Search

 

The inlinks information is stored in the file inlinks.txt. The retriever reads from this file and returns the list of inlinks’s URL and title to the CGI.

 

4.4.4. Outlinks Search

 

The outlinks information is stored in the file outlinks.txt. The retriever reads from this file and returns the list of outlinks’s URL and title to the CGI.

 

 

4.5. Ranker

 

The ranker uses the BM25F algorithm to calculate the weights of each of the documents in the resultant set of documents obtained from the retriever.

 

4.5.1. BM25F [1]

 

BMF25F is an algorithm to improve web search ranking by incorporating field types of the text in the documents. Document structure (or fields), such as the title, headings and the anchor text have been shown to be effective in Web Information Retrieval. The linear combination of scores, which had been the approach, mostly used for the combination of fields, is difficult to interpret due to the non-linear relation between the scores and the term frequencies in each of the fields. In addition, the length normalization that should be applied to each field depends on the nature of the field. Zaragoza et al.[1] introduced a field-based version of BM25, called BM25F, which applies length normalization and weighting of the fields independently. This is an extension of BM25.

 

The table below illustrates when and how we calculate the values necessary to compose the BM25F ranking score. The phase column lists the phase when the formula is calculated. The storage points how we store such values during the process.

 

 

Phase

Formula

Storage

Parser/Indexer

d’ = ∑ (f=1 to k) vf . d[f]

Postings List

Parser/Indexer

dl = document length

Documents List

Parser/Indexer

avdl = average doc length

avdl file

Retriever

Wj(d,C) and W(d,q,C)[1]

n/a

Table 1 - BM25F Calculations and Storage Requirements

 

 

The indexer was extended so that most of the BM25F values for each term were pre-calculated during the indexing process.

In addition to using the BM25F algorithm to generate weights for each individual term when multiple terms are involved, the ranker has an additional method for handling it. The individual weights for each term are summed together, while keeping track of the missing terms. At the end of summing the weights, the total weight is divided by (1+n), where n is the number of terms missing. Instead of outright eliminating a document that is missing one or more terms in a multi-term query, it is instead penalized based on how many terms it is missing. This will result in better recall without too much of a negative impact on precision. In addition, if one of the terms in the query is not in any documents, then it will effectively be ignored instead of having the query return nothing. After the final total score for every document has been computed, the documents are sorted according to weight with an efficient merge sort algorithm.

 

 

 

5.   lswww

 

lswww is the implementation that we built upon. We chose this indexer because it stores the field type information of the text being parsed which is a major requirement for implementing  the BM25F ranking function, as well as the fact that it is modularized so it will be easy to make modifications to it. It also already supports many of the capabilities that are needed, such as listing of positions of terms.

 

The main features that will be added to the lswww indexing are:

 

·      Creation of inlinks and outlinks files: used to provide inlink/outlink search based on the anchor text information.

·      Addition of the weighted sum of the term frequencies in a document, d’, to the postings list.

·      Addition of the term positions to the postings list in order to offer phrase search.

·      Modifications to the doc_index file and the term files to accommodate the storage of pre-calculated values of some of the terms required in the BM25F formula.

·      Creation of an offset file for the documents list in order to speed up the search by reducing the disk I/O.

 

 

 

6.   Running the System

 

To run the system, the following steps must be executed:

 

·      Parser

 

o      Parse the archive: java EzSearch/ezsearch -build summary0-400.warc.gz

o      Sort links file: sortFiles.sh

 

·      Start the server (make sure you use xena to run the server)

 

o      java EzSearch/Server start

 

 

·      Send a query though the CGI interface

 

o      http://wwwtest.cse.lehigh.edu/~ffp206/EzSearch.cgi

 

 

7.   Assignment of Tasks

 

 

Every module has a focal point: the member who was mainly responsible for making sure the tasks were performed and the best results achieved.

Every member closely participated in the ranking implementation, what we considered the main module for this project.

 

Fabiana Prabhakar

·      Team Leader;

·      Focal point for: client, server and javadoc.

 

Shruti Bhandari

·      Focal point for: documentation and ranking

 

Suyun Gu

·      Special focus on the challenges of  multi-term searching.

 

 

 

8.   Documentation

 

8.1. Design document

 

http://www.cse.lehigh.edu/~sug2/ezsearch

 

8.2. Javadoc

 

http://wwwtest.cse.lehigh.edu/~ffp206/ezsearch/doc/

 

 8.3. Implementation Document

 

http://www.cse.lehigh.edu/~ffp206/ezsearch/implementation.htm

 

 

9.   References

 

[1]  S. Robertson, H. Zaragoza and M. Taylor. “Simple BM25 Extension to multiple Weighted Fields”. CIKM’04, November 8-13, 2004, Washington, DC, USA.

[2]  lswww Implementation Document: http://wwwtest.cse.lehigh.edu/~ffp206/lswww.htm


 



[1] Please refer to [1] for the exact formulas.