Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Algoritmi per IR
Latent Semantic Indexing
(not only Keyword Search…)
Speeding up cosine computation
What if we could take our vectors and “pack”
them into fewer dimensions (say
50,000→100) while preserving distances?
Now, O(nm)
Then, O(km+kn) where k << n,m
Two methods:
“Latent semantic indexing”
Random projection
1
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Two approaches
LSI is data-dependent
Create a k-dim subspace by eliminating
redundant axes
Pull together “related” axes – hopefully
car and automobile
Random projection is data-independent
Choose a k-dim subspace that guarantees
probable stretching properties between pair
of points.
Notions from linear algebra
Matrix A, vector v
Matrix transpose (At)
Matrix product
Rank
Eigenvalues λ and eigenvector v: Av = λv
2
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Overview of LSI
Pre-process docs using a technique from
linear algebra called Singular Value
Decomposition
Create a new (smaller) vector space
Queries handled in this new vector space
Intuition
More than dimension reduction:
Derive a set of new uncorrelated features (roughly,
artificial concepts), one per dimension.
Docs with lots of overlapping terms stay together
Terms also get pulled together onto the same dimension
Each term or document is then characterized by a vector
of weights indicating its strength of association with each
of these underlying concepts
Ex. car and automobile get pulled together, since
co-occur in docs with tires, radiator, cylinder,…
Here comes “semantic” !!!
3
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Singular-Value Decomposition
Recall m × n matrix of terms × docs, A.
Define term-term correlation matrix T=AAt
A has rank r ≤ m,n
T is a square, symmetric m × m matrix
Let P be m × r matrix of eigenvectors of T
Define doc-doc correlation matrix D=AtA
D is a square, symmetric n × n matrix
Let R be n × r matrix of eigenvectors of D
A’s decomposition
Do exist matrices P (for T, m × r) and R (for D, n × r)
formed by orthonormal columns (unit dot-product)
It turns out that
A = PΣRt
Where Σ is a diagonal matrix with the
eigenvalues of T=AAt in decreasing order.
m×
×n
A
=
m×
×r
P
r×
×r
Σ
r×
×n
Rt
4
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Dimensionality reduction
For some k << r, zero out all but the k biggest
eigenvalues in Σ [choice of k is crucial]
Denote by Σk this new version of Σ, having rank k
Typically k is about 100, while r (A’s rank) is > 10,000
document
k
k
=
0
0
k
0
r
Ak
Pm x rk
Σk
Rtrk xx nn
useless due to 0-col/0-row of Σk
Guarantee
Ak is a pretty good approximation to A:
Relative distances are (approximately) preserved
Of all m × n matrices of rank k, Ak is the best
approximation to A wrt the following measures:
minB, rank(B)=k ||A-B||2 = ||A-Ak||2 = σk+1
minB, rank(B)=k ||A-B||F2 = ||A-Ak||F2 = σk+12+ σk+22+...+ σr2
Frobenius norm ||A||F2 = σ12+ σ22+...+ σr2
5
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Reduction
Xk = Σk Rt
R,P are formed by
orthonormal eigenvectors
of the matrices D,T
is the doc-matrix k x n, hence reduced to k dim
Take the doc-correlation matrix:
It is D=At A =(P Σ Rt)t (P Σ Rt) = (ΣRt)t (ΣRt)
Approx Σ with Σk, thus get At A ≅ Xkt Xk (both are n x n matr.)
We use Xk to define A’s projection:
Xk = Σk Rt , substitute Rt = Σ−1 Pt A, so get Pkt A .
In fact, Σk Σ−1 Pt = Pkt which is a k x m matrix
This means that to reduce a doc/query vector is
enough to multiply it by Pkt
Cost of sim(q,d), for all d, is O(kn+km) instead of O(mn)
Which are the concepts ?
c-th concept = c-th row of Pkt
(which is k x m)
Denote it by Pkt [c], note its size is m = #terms
Pkt [c][i] = strength of association between c-th
concept and i-th term
Projected document: d’j = Pkt dj
d’j[c] = strenght of concept c in dj
Projected query: q’ = Pkt q
q’ [c] = strenght of concept c in q
6
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Algoritmi per IR
Random Projection
An interesting math result!
Setting v=0 we also get a bound on f(u)’s stretching!!!
7
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
What about the cosine-distance ?
f(u)’s, f(v)’s stretching
Defining the projection matrix
R’s columns
k
8
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
Is R a JL-embedding?
Concentration bound!!!
Such # of
point pairs
Gaussians are good!!
NOTE: Every col of R is unitary and uniformly distributed over the
unit-sphere; moreover, the k cols of R are orthonormal on average.
9
Prof. Paolo Ferragina, Algoritmi per
"Information Retrieval"
A practical-theoretical idea !!!
E[ri,j] = 0
Var[ri,j] = 1
Question !!
Random projections hide large constants
k ≈ (1/ε)2 * log d, so it may be large…
it is simple and fast to compute
LSI is intuitive and may scale to any k
optimal under various metrics
but costly to compute
What about Random LSI ?
It should combine the best of both
10