Compressed
Permuterm Index
Paolo Ferragina
Dipartimento di Informatica, Università di Pisa
Paolo Ferragina, Università di Pisa
A basic problem
Given a dictionary D of strings, having variable length, design a
compressed data structure that supports
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
(Compacted) Trie
 Two versions: for D and for DR
+ Intersect answers
 Need to store D for resolving edge-labels
Paolo Ferragina, Università di Pisa
A basic problem
Given a dictionary D of strings, having variable length, compress them
in a way that we can efficiently support
1) string  id
2) Prefix(a): find all s in D that are prefixed by a
3) Suffix(b): find all s in D that are suffixed by b
4) Substring(g): find all s in D that contain by g
5) PrefixSuffix(a,b) = Prefix(a)  Suffix(b)
Permuterm Index
(Garfield, 76)
 Reduce any query to a “prefix query” over a larger dictionary
Paolo Ferragina, Università di Pisa
Permuterm Index
[Garfield, 1976]
Take a dictionary D={yahoo,google}
1. Append a special char $ to the end of each string
2. Generate all rotations of these strings
Permuterm
Dictionary
yahoo$
ahoo$y
hoo$ya
oo$yah
o$yaho
$yahoo
google$
oogle$g
ogle$go
gle$goo
le$goog
e$googl
$google
Paolo Ferragina, Università di Pisa
Prefix(ya) = Prefix($ya)
Suffix(oo) = Prefix(oo$)
Substring(oo) = Prefix(oo)
PrefixSuffix(y,o)= Prefix(o$y)
Space problems
Any query on D reduces to a prefix-query on P[D]
The FM-index
[Ferragina-Manzini, JACM ‘05]
The result:


Count(P): O(p) time

Locate(P): O(occ * polylog(|T|)) time

Display( T[i,i+L] ): O( L + polylog(|T|) ) time

Space occupancy: |T|
Hk(T)
+ o(|T| log |S|) bits
 New concept: The FM-index is an opportunistic data structure
The main idea is to reduce substring search to
some basic operations over arrays of symbols
Paolo Ferragina, Università di Pisa
Compressed Permuterm index
builds upon the best two features
of the FM-index
Third ingredient: FM-index substring search
P = si
unknown
occ=2
[lr-fr+1]
fr
lr
Paolo Ferragina, Università di Pisa
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
Count(P[1,p]):
 Finds <fr,lr> in O(p) time
Compressed Permuterm Index
Lexicographically sorted
Z = $hat$hip$hop$hot$#
Build FM-index to support substring searches
Some queries are trivial...
 Prefix(a) = Substring search($a) within Z
 Suffix(b) = Substring search(b$) within Z
 Substr(g) = Substring search(g) within Z
Paolo Ferragina, Università di Pisa
PrefixSuffix search
unknown
i=2
Key property:
Last char of si is at L[i+1]
Cyclic-LF[i]
If (i > #D) return LF[i]
else return LF[i+1]
CLF[2]
LF[2]
Paolo Ferragina, Università di Pisa
PrefixSuffix(ho,p)
unknown
$ho
LF
CLF
PrefixSuffix(P):
Search FM-index of Z using
Cyclic-LF instead of LF
Paolo Ferragina, Università di Pisa
No change in time/space bounds
of compressed indexes
Rank and Select of strings
unknown
Z = $hat$hip$hop$hot$#
Other queries...
 Rank(s) = row of $s$
 Select(i)= backw from L[i+1]
Paolo Ferragina, Università di Pisa
A test on URLs
Choose your
trade-off
% dict-size
msec/char, and space close to bzip
• Time close to Front-Coding (4 msec/char), but <50% of its space
• Time of 2060
Paolo Ferragina, Università di Pisa
Scarica

Compressed Permuterm Index