1. Introduction
Search for anything using your favorite crawler-based search engine. Nearly instantly, the search engine will sort through the millions of pages it knows about and will present you with ones that match your topic. The matches will even be ranked, so that the most relevant ones come first. Our second project consists of extending the functionality of the indexer (project1) and incorporating text-based multi-term retriever and ranking system with a Web (CGI) interface. The remainder of this document is organized as follows: Section 2 presents the indexed data that will be used for retrieval. Section 3 lists the resources chosen for the implementation. Section 4 describes EzSearch architecture. Section 5 lists the changes to the indexer in order to support ranking. Section 6 illustrates how the team is organized. Section 7 lists the documentation and Section 8 lists the references.
2. Data
We will use the data parsed and
indexed by the project lswww, which in turn used a
subset of documents from a recent crawl of the
3. Resources
There is a large variety of programming languages that could be taken in consideration for this project. The following sections list the languages we chose, explain why we chose them.
3.1.
Java
Java will be the core language for our implementation. Our choice was mainly based on the fact that Java was the language used for lswww, the implementation we are going to build upon. We also considered our group’s past developing experience in Java as a factor in our decision.
3.2.
Perl
For the web interface implementation, we decided that Perl would be the best language for that task as a lot of online support is available for the PERL-CGI script and also, we have a basic interface implemented already as a part of Homewok1.
The CGI basically displays an
editable field where the user will type the query. It also displays the results
received from the server.
4. lswww-2
Architecture
4.1. Overall
Architecture

4.2. The CGI
Interface
The
CGI is a Perl interface that allows a user to send a query to the search engine
and check the results. The query can be composed of multiple terms. The Retriever
will be responsible to support multiple terms query (phrases).
4.3. Server
The
server receives a request from the client (CGI) and calls the retriever module.
The retriever will retrieve the documents and use the ranker module to rank the
results. The results are sent back to the client.
4.4. Retriever
The retriever module will call the
ranking module to see which pages have the highest weights, or to see which
pages have exact phrase matches. The top 100 of these results will be ordered
and sent to the server.
4.5. Ranker
After a
considerable discussion, we concluded that we would go in with the BM25F
(BM-Best Match) ranking function. The reason for this decision was that we
wanted to make use of the term weights and the term frequency to contribute
towards the ranking of the result set of documents relevant for the given
query.
We will
also give an exact phrase match a higher priority then any other ranking if
specified by the user. The user may specify that they wish to have an exact
phrase match simply by putting quotes around their query. This method will be
implemented by first checking if the first word in the query exists in a given
document, then checking if the second word in the query exists right after an
instance of the first word, and the process will be repeated for every word in
the query. If quotes are not used, then this check will not be done and ranking
will be done solely on the BM25F method.
BM25F
[1]
BMF25F is an algorithm to improve
web search ranking by incorporating user behavior. Document structure (or
fields), such as the title, headings and the anchor text have been shown to be
effective in Web Information Retrieval. The linear combination of scores, which
had been the approach, mostly used for the combination of fields, is difficult
to interpret due to the non-linear relation between the scores and the term
frequencies in each of the fields. In addition, the length normalization that
should be applied to each field depends on the nature of the field. Zaragoza et
al.[1] introduced a field-based version of BM25,
called BM25F, which applies length normalization and weighting of the fields
independently. This is an extension of BM25.
The indexer will be extended so
that the BM25F values for each term will be
pre-calculated during the indexing process.
5. lswww
lswww is the implementation we are going to build upon. We have chosen this indexer because it appears to be the most robust one out of the four available, as well as the fact that it is modularized so it will be easy to make modifications to it. It also already supports many of the capabilities that are needed, such as weighting. lswww is also known to be capable of performing the indexing on at least one of the compressed files without any flaws or crashes.
The main features that will be
added to the lswww indexing are:
·
Creation of a Links File: used to provide inlink/outlink search based on the anchor text information.
· Adjust weights calculation that is stored in the postings list for further use by the Ranker.
·
Adding the BM25F values to the postings list.
6. Assignment
of Tasks
All
the members will participate in all the modules. Every module has a focal
point: the member who is mainly responsible for making sure the tasks are
performed and the best results are achieved.
Every
member will closely participate in the ranking implementation, what we consider
the main module for this project.
Fabiana Prabhakar
·
Team Leader;
·
Focal point for:
client and server.
Shruti Bhandari
·
Focal point for:
documentation and ranking
Suyun Gu
·
Focal point for: javadoc and retrieval.
·
Special focus on the
challenges of phase searching.
7. Documentation
7.1. Design
document
http://www.cse.lehigh.edu/~sug2/ezsearch
7.2. Implementation document
Future
documentation.
8. References
[1]
S. Robertson, H. Zaragoza and M. Taylor. “Simple BM25 Extension to multiple Weighted
Fields”. CIKM’04,
9. Authors’ Signatures