The Future of Web Search
Barcelona, May 2006
XML
Compression and Indexing
Paolo Ferragina
Dipartimento di Informatica, Università di Pisa
[Joint with F. Luccio, G. Manzini, S. Muthukrishnan]
Paolo Ferragina, Università di Pisa
Under patenting by
Pisa-Rutgers Univ.
Compressed
Permuterm Index
Paolo Ferragina, Rossano Venturini
Dipartimento di Informatica, Università di Pisa
Paolo Ferragina, Università di Pisa
Under
Y!-patenting
A basic problem
Given a dictionary D of strings, having variable length, design a
compressed data structure that supports
1) string  id
2) Prefix(a): find all strings in D that are prefixed by a
3) Suffix(b): find all strings in D that are suffixed by b
4) Substring(g): find all strings in D that contain g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
IR book of Manning-Raghavan-Schutze
 Tolerant Retrieval Problem (wildcards)
Paolo Ferragina, Università di Pisa
Prefix(a) = a*
Suffix(b) = *b
Substring(g) = *g*
PrefixSuffix(a,b) = a*b
A basic problem
Given a dictionary D of strings, having variable length, design a
compressed data structure that supports
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
Hashing
 Not exact searches
Paolo Ferragina, Università di Pisa
A basic problem
Given a dictionary D of strings, having variable length, design a
compressed data structure that supports
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
(Compacted) Trie
 Two versions: for D and for DR + Intersect answers
 No substring search (unless using Suffix Trie)
 Need to store D for resolving edge-labels
Paolo Ferragina, Università di Pisa
A basic problem
Given a dictionary D of strings, having variable length, design a
compressed data structure that supports
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
Front coding...
Paolo Ferragina, Università di Pisa
Front-coding
uk-2002 crawl ≈250Mb
http://checkmate.com/All_Natural/
http://checkmate.com/All_Natural/Applied.html
http://checkmate.com/All_Natural/Aroma.html
http://checkmate.com/All_Natural/Aroma1.html
http://checkmate.com/All_Natural/Aromatic_Art.html
http://checkmate.com/All_Natural/Ayate.html
http://checkmate.com/All_Natural/Ayer_Soap.html
http://checkmate.com/All_Natural/Ayurvedic_Soap.html
http://checkmate.com/All_Natural/Bath_Salt_Bulk.html
http://checkmate.com/All_Natural/Bath_Salts.html
http://checkmate.com/All/Essence_Oils.html
http://checkmate.com/All/Mineral_Bath_Crystals.html
http://checkmate.com/All/Mineral_Bath_Salt.html
http://checkmate.com/All/Mineral_Cream.html
3035%
0
33
34
38
38
34
35
35
33
42
25
25
38
33
http://checkmate.com/All_Natural/
Applied.html
roma.html
1.html
tic_Art.html
yate.html
er_Soap.html
urvedic_Soap.html
Bath_Salt_Bulk.html
s.html
Essence_Oils.html
Mineral_Bath_Crystals.html
Salt.html
Cream.html
0
http://checkmate.com/All/Natural/Washcloth.html...
http://checkmate.com/All/Natural/Washcloth.html
...
bzip ≈ 10%
Be back on this, later on!
 Two versions: for D and for DR + Intersect answers
 Need some extra data structures for bucket identification
 No substring search
Paolo Ferragina, Università di Pisa
A basic problem
Given a dictionary D of strings, having variable length, compress them
in a way that we can efficiently support
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain by g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
Permuterm Index
(Garfield, 76)
 Reduce any query to a “prefix query” over a larger dictionary
Paolo Ferragina, Università di Pisa
Premuterm Index
[Garfield, 1976]
Take a dictionary D={yahoo,google}
1. Append a special char $ to the end of each string
2. Generate all rotations of these strings
Permuterm
Dictionary
yahoo$
ahoo$y
hoo$ya
oo$yah
o$yaho
$yahoo
google$
oogle$g
ogle$go
gle$goo
le$goog
e$googl
$google
Paolo Ferragina, Università di Pisa
Prefix(ya) = Prefix($ya)
Suffix(oo) = Prefix(oo$)
Substring(oo) = Prefix(oo)
PrefixSuffix(y,o)= Prefix(o$y)
Space problems
Any query on D reduces to a prefix-query on P[D]
Compressed Permuterm Index
SIGIR ‘07
It deploys two ingredients:
 Permuterm index
 Compressed full-text index
Theoretically:
 Query ops take optimal time: proportional to pattern length
 Space occupancy is |D| Hk(D) + o(|D| log |S|) bits
Technically:
A simple reduction step: Permuterm  Compressed index
 Re-use known machinery on compressed indexes
 Achieve bzip-compression at Front-coding speed
Paolo Ferragina, Università di Pisa
The Burrows-Wheeler Transform
(1994)
Take the text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Paolo Ferragina, Università di Pisa
Sort the rows
F
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
Compressing L is effective
Key observation:

L is locally homogeneous
L is highly compressible
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
Paolo Ferragina, Università di Pisa
The FM-index
[Ferragina-Manzini, JACM ‘05]
Survey of Navarro-Makinen
contains many other indexes
The result:


Count(P): O(p) time

Locate(P): O(occ * polylog(|T|)) time

Display( T[i,i+L] ): O( L + polylog(|T|) ) time

Space occupancy: |T|
Hk(T)
+ o(|T| log |S|) bits
 New concept: The FM-index is an opportunistic data structure
The main idea is to reduce substring search to
some basic operations over arrays of symbols
Paolo Ferragina, Università di Pisa
Compressed Permuterm index
builds upon the best two features
of the FM-index
First ingredient: L
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
Paolo Ferragina, Università di Pisa
L
i
p
s
s
m
#
p
i
s
s
i
i

F mapping
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars in F...
Take two equal L’s chars
Rotate rightward their rows
Same relative order !!
First ingredient: L
1
2
6
7
9
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i

F mapping
The oracle
Rank( s , 9 ) = 3
FM-index is actually
Rank ds over BWT
O(1) time and Hk-space
Paolo Ferragina, Università di Pisa
Second ingredient: Backward step
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
LF
ississippi
i#mississi
LF
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
Paolo Ferragina, Università di Pisa
L
i
p
s
s
m
#
p
i
s
s
i
i
T scanned backward
by using LF-mapping
...s s i...
Backward step(i):
 Return LF[i], in O(1) time
Third ingredient: substring search
P = si
unknown
occ=2
[lr-fr+1]
fr
lr
Paolo Ferragina, Università di Pisa
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Count(P[1,p]):
 Finds <fr,lr> in O(p) time
The Comprressed Permuterm
Lexicographically sorted
Z = $hat$hip$hop$hot$#
Build FM-index to support substring searches
Some queries are trivial...
 Prefix(a) = Substring search($a) within Z
 Suffix(b) = Substring search(b$) within Z
 Substr(g) = Substring search(g) within Z
Paolo Ferragina, Università di Pisa
PrefixSuffix search
unknown
i=3
Key property:
Last char of si is at L[i+1]
Cyclic-LF[i]
If (i > #D) return LF[i]
else return LF[i+1]
LF[3]
CLF[3]
Paolo Ferragina, Università di Pisa
PrefixSuffix(ho,p)
unknown
$ho
LF
CLF
PrefixSuffix(P):
Search FM-index of Z using
Cyclic-LF instead of LF
Paolo Ferragina, Università di Pisa
No change in time/space bounds
of compressed indexes
Rank and Select of strings
unknown
Z = $hat$hip$hop$hot$#
Other queries...
 Rank(s) = row of $s$
 Select(i)= backw from L[i+1]
Paolo Ferragina, Università di Pisa
Experiments
Three dictionaries:
 Term dictionary: Trec WT10G
 Host dictionary (reversed): UK-2005
 Url dictionary (host reversed): first 190Mb of UK-2005
Term
Host
Url
size
118 Mb
34 Mb
190 Mb
# strings
10 Mil
2 Mil
3 Mil
FC
40%
45%
30%
bzip
33%
25%
10%
Paolo Ferragina, Università di Pisa
PrefixSuffix search needs *2
Paolo Ferragina, Università di Pisa
A test on URLs
MRS book says: “one
Choose your
trade-off
% dict-size
disadvantage of the PI is that its
dictionary becomes quite large,
including as it does all rotations
of each term”.
Now, they
mention CPI 
msec/char, and space close to bzip
• Time close to Front-Coding (4 msec/char), but <50% of its space
• Time of 2060
Paolo Ferragina, Università di Pisa
We proposed an approach for dictionary storage:
+ Theory: optimal time and entropy-bounds for space
+ Practice: trades time vs space, thus fitting user needs
Paolo Ferragina, Università di Pisa
Scarica

Compressed Permuterm Index