Book Chapter (16 pages)
Postscript (510KB) PDF (133KB)
April Kontostathis, William M. Pottenger, and Brian D. Davison
In this chapter we analyze the values used by Latent Sematic Indexing (LSI) for information retrieval. By manipulating the values in the Singular Value Decomposition (SVD) matrices, we find that a significant fraction of the values have little effect on overall performance, and can thus be removed (changed to zero). This allows us to convert the dense term by dimension and document by dimension matrices into sparse matrices by identifying and removing those entries. We empirically show that these entries are unimportant by presenting retrieval and runtime performance results, using seven collections, which show that remov al of up 70% of the values in the term by dimension matrix results in similar or improved retrieval performance (as compared to LSI). Removal of 90% of the values degrades retrieval performance slightly for smaller collections, but improves retrieval performance by 60% on the large TREC collection we tested. Our approach additionally has the computational benefit of reducing memory requirements and query response time.
Book chapter in T. Y. Lin, S. Ohsuga, C.J. Liau, and S. Tsumoto (eds.), Foundations of Data Mining and Knowledge Discovery, Studies in Computational Intelligence, Volume 6, pages 333-346, Springer-Verlag, 2005
Back to Brian Davison's publications