Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Algoritmi per IR Latent Semantic Indexing (not only Keyword Search…) Speeding up cosine computation What if we could take our vectors and “pack” them into fewer dimensions (say 50,000→100) while preserving distances? Now, O(nm) Then, O(km+kn) where k << n,m Two methods: “Latent semantic indexing” Random projection 1 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Two approaches LSI is data-dependent Create a k-dim subspace by eliminating redundant axes Pull together “related” axes – hopefully car and automobile Random projection is data-independent Choose a k-dim subspace that guarantees probable stretching properties between pair of points. Notions from linear algebra Matrix A, vector v Matrix transpose (At) Matrix product Rank Eigenvalues λ and eigenvector v: Av = λv 2 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Overview of LSI Pre-process docs using a technique from linear algebra called Singular Value Decomposition Create a new (smaller) vector space Queries handled in this new vector space Intuition More than dimension reduction: Derive a set of new uncorrelated features (roughly, artificial concepts), one per dimension. Docs with lots of overlapping terms stay together Terms also get pulled together onto the same dimension Each term or document is then characterized by a vector of weights indicating its strength of association with each of these underlying concepts Ex. car and automobile get pulled together, since co-occur in docs with tires, radiator, cylinder,… Here comes “semantic” !!! 3 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Singular-Value Decomposition Recall m × n matrix of terms × docs, A. Define term-term correlation matrix T=AAt A has rank r ≤ m,n T is a square, symmetric m × m matrix Let P be m × r matrix of eigenvectors of T Define doc-doc correlation matrix D=AtA D is a square, symmetric n × n matrix Let R be n × r matrix of eigenvectors of D A’s decomposition Do exist matrices P (for T, m × r) and R (for D, n × r) formed by orthonormal columns (unit dot-product) It turns out that A = PΣRt Where Σ is a diagonal matrix with the eigenvalues of T=AAt in decreasing order. m× ×n A = m× ×r P r× ×r Σ r× ×n Rt 4 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dimensionality reduction For some k << r, zero out all but the k biggest eigenvalues in Σ [choice of k is crucial] Denote by Σk this new version of Σ, having rank k Typically k is about 100, while r (A’s rank) is > 10,000 document k k = 0 0 k 0 r Ak Pm x rk Σk Rtrk xx nn useless due to 0-col/0-row of Σk Guarantee Ak is a pretty good approximation to A: Relative distances are (approximately) preserved Of all m × n matrices of rank k, Ak is the best approximation to A wrt the following measures: minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = σk+1 minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = σk+12+ σk+22+...+ σr2 Frobenius norm ||A||F2 = σ12+ σ22+...+ σr2 5 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Reduction Xk = Σk Rt R,P are formed by orthonormal eigenvectors of the matrices D,T is the doc-matrix k x n, hence reduced to k dim Take the doc-correlation matrix: It is D=At A =(P Σ Rt)t (P Σ Rt) = (ΣRt)t (ΣRt) Approx Σ with Σk, thus get At A ≅ Xkt Xk (both are n x n matr.) We use Xk to define A’s projection: Xk = Σk Rt , substitute Rt = Σ−1 Pt A, so get Pkt A . In fact, Σk Σ−1 Pt = Pkt which is a k x m matrix This means that to reduce a doc/query vector is enough to multiply it by Pkt Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn) Which are the concepts ? c-th concept = c-th row of Pkt (which is k x m) Denote it by Pkt [c], note its size is m = #terms Pkt [c][i] = strength of association between c-th concept and i-th term Projected document: d’j = Pkt dj d’j[c] = strenght of concept c in dj Projected query: q’ = Pkt q q’ [c] = strenght of concept c in q 6 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Algoritmi per IR Random Projection An interesting math result! Setting v=0 we also get a bound on f(u)’s stretching!!! 7 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" What about the cosine-distance ? f(u)’s, f(v)’s stretching Defining the projection matrix R’s columns k 8 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Is R a JL-embedding? Concentration bound!!! Such # of point pairs Gaussians are good!! NOTE: Every col of R is unitary and uniformly distributed over the unit-sphere; moreover, the k cols of R are orthonormal on average. 9 Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A practical-theoretical idea !!! E[ri,j] = 0 Var[ri,j] = 1 Question !! Random projections hide large constants k ≈ (1/ε)2 * log d, so it may be large… it is simple and fast to compute LSI is intuitive and may scale to any k optimal under various metrics but costly to compute What about Random LSI ? It should combine the best of both 10