By using SIAM Journals Online you agree to abide by the
Terms and Conditions of Use.

©  SIAM

 

SIAM Review

Previous Article
Education
The Education section is intended to provide modules that teachers and students of applied mathematics and scientific computation can use directly in studying these fields, in courses and beyond. Thes...
Next Article
Bubbles in Wet, Gummed Wine Labels
It is shown that bubbling on wine bottle labels is due to absorption of water from the glue, with subsequent hygroscopic expansion. Contrary to popular belief, most of the glue's water must be lost t...

You are logged in to this journal.

Matrices, Vector Spaces, and Information Retrieval

SIAM Rev. Volume 41, Issue 2, pp. 335-362 (1999)

Issue Date: 1999
FULL TEXT OPTIONS   (FREE)
Download PDF (380 kB) Download Compressed PostScript View Cart

The evolution of digital libraries and the Internet has dramatically transformed the processing, storage, and retrieval of information. Efforts to digitize text, images, video, and audio now consume a substantial portion of both academic and industrial activity. Even when there is no shortage of textual materials on a particular topic, procedures for indexing or extracting the knowledge or conceptual information contained in them can be lacking. Recently developed information retrieval technologies are based on the concept of a vector space. Data are modeled as a matrix, and a user's query of the database is represented as a vector. Relevant documents in the database are then identified via simple vector operations. Orthogonal factorizations of the matrix provide mechanisms for handling uncertainty in the database itself. The purpose of this paper is to show how such fundamental mathematical concepts from linear algebra can be used to manage and index large text collections.

©1999 Society for Industrial and Applied Mathematics

KEYWORDS and AMS

Keywords
AMS Subject Classifications
15-01, 15A03, 15A18, 65F50, 68P20

PUBLICATION DATA

ISSN:
0036-1445 (print)   1095-7200 (online)
Publisher:
AIP is a member of CrossRef SIAM

REFERENCES (60)

  1. E. Anderson, Z. Bai, C. Bischof, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, S. Ostrouchov, and D. Sorensen, LAPACK Users' Guide, SIAM, Philadelphia, second ed., 1995 doi:10.1137/0719089. [Inspec] [ISI] [ZentralblattMath] [MathRev]
  2. Harry Andrews, Claude Patterson, Outer product expansions and their uses in digital image processing, Amer. Math. Monthly, 82 (1975), 1–13
  3. Richard Barrett, Michael Berry, Tony Chan, Templates for the solution of linear systems: building blocks for iterative methods, Society for Industrial and Applied Mathematics (SIAM), 1994xiv+112 [ZentralblattMath] [MathRev]
  4. D. Berg, A Guide to the Oxford English Dictionary, Oxford University Press, Oxford, 1993.
  5. M. Berry, Large scale singular value computations, International Journal of Supercomputer Applications, 6 (1992), pp. 13–49. [Inspec]
  6. SVDPACK: A Fortran 77 software library for the sparse singular value decomposition, Tech. Rep. CS–92–159, University of Tennessee, Knoxville, TN, June 1992.
  7. Survey of public-domain Lanczos-based software, in Proceedings of the Cornelius Lanczos Centenary Conference, J. Brown, M. Chu, D. Ellison, and R. Plemmons, eds., 1997, pp. 332–334.
  8. M. Berry, T. Do, G. O'Brien, V. Krishna, and S. Varadhan, SVDPACKC: Version 1.0 user's guide, Tech. Rep. CS–93–194, University of Tennessee, Knoxville, TN, October 1993.
  9. Michael Berry, Susan Dumais, Gavin O'Brien, Using linear algebra for intelligent information retrieval, SIAM Rev., 37 (1995), 573–595
  10. M. Berry and R. Fierro, Low-rank orthogonal decompositions for information retrieval applications, Numerical Linear Algebra With Applications, 3 (1996), pp. 301–328.
  11. M. Berry, B. Hendrickson, and P. Raghavan, Sparse matrix reordering schemes for browsing hypertext, in Lectures in Applied Mathematics Vol. 32: The Mathematics of Numerical Analysis, J. Renegar, M. Shub, and S. Smale, eds., American Mathematical Society, 1996, pp. 99–123.
  12. M. Berry and D. Witter, Intelligent information management using latent semantic indexing, in Proceedings of Interface'97, Interface of North America Foundation, 1997.
  13. K. Bharat and A. Broder, Estimating the relative size and overlap of public web search engines, in 7th International World Wide Web Conference, Elsevier Science, 1998.
  14. Paper FP37.
  15. Åke Björck, Numerical methods for least squares problems, Society for Industrial and Applied Mathematics (SIAM), 1996xviii+408 [MathRev]
  16. D. Bogard, ed., The Bowker Annual Library and Book Trade Almanac, R. R. Bowker Co., New Providence, N. J., 43rd ed., 1998.
  17. Books in Print, R. R. Bowker Co., New York, 1997/8.
  18. R. Bording, A. Gertsztenkorn, L. Lines, J. Scales, and S. Treitel, Applications of seismic travel-time tomography, Geophys. J. R. Astr. Soc., 90 (1987), pp. 285–304. [Inspec] [ISI]
  19. C. Buckley, G. Salton, J. Allan, and A. Singhanl, Automatic query expansion using SMART: TREC 3, in Overview of the Third Text REtrieval Conference, D. Harman, ed., National Institute of Standards and Technology Special Publication 500-226, April 1995.
  20. S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, Indexing by latent semantic analysis, Journal of the American Society for Information Science, 41 (1990), pp. 391–407. [ISI]
  21. S. Dumais, Improving the retrieval of information from external sources, Behavior Research Methods, Instruments, & Computers, 23 (1991), pp. 229–236.
  22. LSI meets TREC: A status report., in The First Text REtrieval Conference, D. Harman, ed., National Institute of Standards and Technology Special Publication 500-207, March 1993, pp. 137–152.
  23. Latent Semantic Indexing (LSI) and TREC-2, in The Second Text REtrieval Conference, D. Harman, ed., National Institute of Standards and Technology Special Publication 500-215, March 1994, pp. 105–116.
  24. C. Eckart and G. Young, The approximation of one matrix by another lower rank, Psychometrika, 1 (1936), pp. 211–218.
  25. P. Foltz and S. Dumais, Personalized information delivery: An analysis of information filtering methods, Communications of the ACM, 35 (1992), pp. 51–60. [Inspec] [ISI]
  26. W. Frakes and R. Baeza-Yates, Information Retrieval: Data Structures & Algorithms, Prentice-Hall, Englewood Cliffs, NJ, 1992.
  27. ftp://ftp.cs.cornell.edu/pub/smart/med/. May 27, 1998.
  28. Dario Bini, Using FFT-based techniques in polynomial and matrix computations: recent advances and applications, Proceedings of the International Conference on Fourier Analysis and Applications (Kuwait, 1998), Vol. 21, 2000, 47–66 [MathRev]
  29. G. H. Golub, V. Klema, and G. W. Stewart, Rank degeneracy and least squares problems, Technical Report TR-456, Department of Computer Science, University of Maryland, 1976.
  30. D. Harman, Overview of the third text REtrieval conference (TREC-3), in Overview of the Third Text REtrieval Conference, D. Harman, ed., National Institute of Standards and Technology Special Publication 500-226, April 1995, pp. 1–21.
  31. D. Harman and E. Voorhees, Overview of the fifth text REtrieval conference (TREC-5), in Information Technology: The Fifth Text REtrieval Conference (TREC-5), D. Harman and E. Voorhees, eds., National Institute of Standards and Technology Special Publication 500-238, November 1996, pp. 1–28.
  32. http://lcweb.loc.gov/. April 24, 1998.
  33. http://www.searchenginewatch.com/. March 25, 1998.
  34. http://www.tipster.org. May 28, 1998.
  35. W. Jones and G. Furnas, Pictures of relevance: A geometric analysis of similarity measures, Journal of the American Society for Information Science, 38 (1987), pp. 420–442.
  36. W. Kahan, Conserving confluence curbs ill-conditioning, Technical Report 6, Computer Science Department, University of California, Berkeley, 1972.
  37. T. Kolda and D. O'Leary, A semi-discrete matrix decomposition for latent semantic indexing in information retrieval, ACM Transactions on Information Systems, (1998). To appear.
  38. G. Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer Academic Publishers, Boston, 1997.
  39. Cornelius Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, J. Research Nat. Bur. Standards, 45 (1950), 255–282 [MathRev]
  40. S. Lawrence and C. Giles, Searching the world wide web, Science, 280 (1998), pp. 98–100.
  41. R. Lehoucq, Analysis and Implementation of an Implicitly Restarted Arnoldi Iteration, PhD thesis, Rice University, Houston, TX, 1995.
  42. R. Lehoucq, D. Sorensen, Deflation techniques for an implicitly restarted Arnoldi iteration, SIAM J. Matrix Anal. Appl., 17 (1996), 789–821
  43. T. Letsche and M. Berry, Large-scale information retrieval with latent semantic indexing, Information Sciences, 100 (1997), pp. 105–137. [Inspec]
  44. J. Lovins, Development of a stemming algorithm, Mechanical Translation and Computational Linguistics, 11 (1968), pp. 22–31.
  45. L. Mirsky, Symmetric gauge functions and unitarily invariant norms, Quart. J. Math. Oxford Ser. (2), 11 (1960), 50–59 [MathRev]
  46. Cleve Moler, Donald Morrison, Singular value analysis of cryptograms, Amer. Math. Monthly, 90 (1983), 78–87 [MathRev]
  47. G. O'Brien, Information management tools for updating an SVD-encoded indexing scheme, Master's thesis, University of Tennessee, Knoxville, TN, 1994.
  48. Françoise Tisseur, Jack Dongarra, A parallel divide and conquer algorithm for the symmetric eigenvalue problem on distributed memory architectures, SIAM J. Sci. Comput., 20 (1999), 2223–2236 [MathRev]
  49. L. Rabiner and R. Schafer, Digital Processing of Speech Signals, Prentice Hall, Englewood Cliffs, N. J., first ed., 1978.
  50. H. Rutishauser, Simultaneous iteration method for symmetric matrices, Numerische Mathematik, 16 (1970), pp. 205–223. [ISI] [ZentralblattMath]
  51. G. Salton, Automatic Information Organization and Retrieval, McGraw Hill, New York, 1968.
  52. G. Salton and C. Buckley, Improving retrieval performance by relevance feedback, Journal of the American Society for Information Science, 41 (1990), pp. 288–297.
  53. G. Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw Hill, New York, 1983.
  54. Ahmed Sameh, John Wisniewski, A trace minimization algorithm for the generalized eigenvalue problem, SIAM J. Numer. Anal., 19 (1982), 1243–1259
  55. J. Scales, P. Dochery, and A. Gerszternkorn, Regularization of nonlinear inverse problems: imaging the near-surface weathering layer, Inverse Problems, 6 (1990), pp. 115–131.
  56. Hongyuan Zha, Horst Simon, On updating problems in latent semantic indexing, SIAM J. Sci. Comput., 21 (1999), 782–791
  57. K. Sparck Jones, A statistical interpretation of term specificity and its applications in retrieval, Journal of Documentation, 28 (1972), pp. 11–21.
  58. G. Stewart, Rank degeneracy, SIAM J. Sci. Statist. Comput., 5 (1984), 403–413
  59. Ulrich's international periodicals directory, R. R. Bowker Co., New York, 1998.
  60. Sowmini Varadhan, Michael Berry, Gene Golub, Approximating dominant singular triplets of large sparse matrices via modified moments, Numer. Algorithms, 13 (1996), 123–152
  61. D. Witter, Downdating the latent semantic indexing model for information retrieval, Master's thesis, University of Tennessee, Knoxville, TN, 1997.