On Compression and Indexing:
two sides of the same coin
Paolo Ferragina
Dipartimento di Informatica, Università di Pisa
Paolo Ferragina, Università di Pisa
What do we mean by “Indexing” ?
Types of data
Linguistic or tokenizable text
Raw sequence of characters or bytes
Types of query
Word-based query
Character-based query
DNA sequences
Audio-video files
Executables
Arbitrary substring
Complex match
Two indexing approaches :
• Word-based indexes, here a notion of “word” must be devised !
» Inverted files, Signature files, Bitmaps.
• Full-text indexes, no constraint on text and queries !
» Suffix Array, Suffix tree, String B-tree [Ferragina-Grossi, JACM 99].
Paolo Ferragina, Università di Pisa
What do we mean by “Compression” ?
 Compression has two positive effects:
 Space saving (or, double memory at the same cost)
 Performance improvement
 Better use of memory levels close to processor
 Increased disk and memory bandwidth
 Reduced (mechanical) seek time
» CPU speed nowadays makes (de)compression “costless” !!
Moral:
More economical to store data in compressed form than uncompressed
 From March 2001 the Memory eXpansion Technology (MXT) is
available on IBM eServers x330MXT
 Same performance of a PC with double memory but at half cost
Paolo Ferragina, Università di Pisa
Compression and Indexing: Two sides of the same coin !
 Do we witness a paradoxical situation ?
 An index injects redundant data, in order to speed up the pattern searches
 Compression removes redundancy, in order to squeeze the space occupancy
 NO, new results proved a mutual reinforcement behaviour !
 Better indexes can be designed by exploiting compression techniques
 Better compressors can be designed by exploiting indexing techniques
In terms of space occupancy
Also in terms of compression ratio
Moral:
CPM researchers must have a multidisciplinary background, ranging from
Data structure design to Data compression, from Architectural
Knowledge to Database principles, till Algoritmic Engineering and more...
Paolo Ferragina, Università di Pisa
Our journey, today...
Index design (Weiner ’73)
Compressor design (Shannon ’48)
Burrows-Wheeler Transform
Suffix Array
(1994)
(1990)
Compressed Index
-Space close to gzip, bzip
- Query time close to O(|P|)
Compression Booster
Tool to transform a poor compressor
into a better compression algorithm
Wavelet Tree
Improved
Indexes and Compressors
Paolo Ferragina, Università di Pisa
The Suffix Array
[BaezaYates-Gonnet, 87 and Manber-Myers, 90]
Prop 1. All suffixes in SUF(T) having prefix P are contiguous
Prop 2. These suffixes follow P’s lexicographic position
5
Q(N2) space
SA
SUF(T)
12
11
8
5
2
1
10
9
7
4
6
3
#
i#
ippi#
issippi#
ississippi#
mississippi#
pi#
ppi#
sippi#
sissippi#
ssippi#
ssissippi#
Paolo Ferragina, Università di Pisa
T = mississippi#
suffix pointer
P=si
SA + T occupy Q(N log2 N) bits
Searching in Suffix Array
T = mississippi#
SA
12
11
8
5
2
1
10
9
7
4
6
3
[Manber-Myers, 90]
Suffix Array search
• O(log2 N) binary-search steps
• Each step takes O( |P| ) char cmp
 overall, O(|P| log2 N) time
P=si
final
ppi#
sippi#
sissippi# P=si
P=si
 O(|P| + log2 N) time
 O(|P|/B + logB N)
I/Os [JACM 99]
 Self-adjusting version on disk [FOCS 02]
Suffix permutation cannot be any of {1,...,N}
# binary texts = 2N « N! = # permutations on {1, 2, ..., N}
 N log2 N is not a lower bound to the bit space occupancy 
Paolo Ferragina, Università di Pisa
The Burrows-Wheeler Transform
(1994)
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Paolo Ferragina, Università di Pisa
Sort the rows
F
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
T
Why L is so interesting for compression ?
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
L
i
p
s
s
m
#
p
i
s
s
i
i
A key observation:

L is locally homogeneous
L is highly compressible
Algorithm Bzip :
 Move-to-Front coding of L
 Run-Length coding
 Statistical coder: Arithmetic, Huffman
T=mississippi#
Building the BWT  SA construction
Inverting the BWT  array visit
 Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression !
...overall O(N) time...
Paolo Ferragina, Università di Pisa
Suffix Array vs. BW-transform
SA
12
11
8
5
2
1
10
9
7
4
6
Rotated text
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
3
Paolo Ferragina, Università di Pisa
L
i
p
s
s
m
#
p
i
s
s
i
i
L includes SA and T. Can
we search within L ?
A compressed index
[Ferragina-Manzini, IEEE Focs 2000]
Bridging data-structure design and compression techniques:

Suffix array data structure

Burrows-Wheeler Transform
Index does not depend on k
Bound holds for all k, simultaneously
The theoretical result:

e
Query complexity: O(p + occ log N) time

Space occupancy: O( N
Hk(T)) + o(N) bits
 o(N) if T compressible
k-th order empirical entropy
The corollary is that:


The Suffix Array is compressible
It is a self-index
O(n) space: A plethora of papers
Hk : Grossi-Gupta-Vitter (03), Sadakane (02),...
Now, more than 20 papers with more
than 20 authors on related subjects
In practice, the index is much appealing:


Space close to the best known compressors, ie. bzip
Query time of few millisecs on hundreds of MBs
Paolo Ferragina, Università di Pisa
A useful tool: L
F
#
i
i
i
i
m
p
p
s
s
s
s

unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
Paolo Ferragina, Università di Pisa
F mapping
L
i
p
s
s
m
#
p
i
s
s
i
i
How do we map L’s onto F’s chars ?
... Need to distinguish equal chars...
Substring search in T (Count occurrences)
P[ j ]
C
P = si
First step
rows prefixed
by char “i”
fr
lr
rows prefixed
occ=2
by “si”
[lr-fr+1]
fr
lr
unknown
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
Paolo Ferragina, Università di Pisa
L
i
p
s
s
s
s
m
#
p
i
s
s
i
i
#
i
m
p
s
0
1
6
7
9
Inductive step: Given fr,lr for P[j+1,p]
 Take P[j]
 Find first P[j] in L[fr, lr]
 Find last P[j] in L[fr, lr]
 L-to-F mapping of these chars
Many details are missing...
 The column L is actually kept compressed:
• Still guarantee O(p) time to count the P’s occurrences
 The Locate operation takes O(loge N) time
• Some additional data structure, in o(n) bits
 Efficient and succinct index construction
[Hon et al., Focs 03]
 Bio-application: fit Human-genome index in a PC
[Sadakane et al., 02]
Interesting issues:
 What about arbitrary alphabets ? [Grossi et al., 03; Ferragina et al., 04]
 What about disk-aware, or cache-oblivious, or self-adjusting versions ?
 What about challenging applications: bio,db,data mining,handheld PC, ...
Paolo Ferragina, Università di Pisa
Where we are ...
We investigated the reinforcement relation:
Compression ideas  Index design
Let’s now turn to the other direction
Indexing ideas  Compressor design
Booster
Paolo Ferragina, Università di Pisa
What do we mean by “boosting” ?
It is a technique that takes a poor compressor A and turns it
into a compressor with better performance guarantee
A memoryless compressor is poor in that it assigns codewords
to symbols according only to their frequencies (e.g. Huffman)
It incurs in some obvious limitations:
T = anbn
T’= random string of length 2n and same number of ‘a,b’
Paolo Ferragina, Università di Pisa
What do we mean by “boosting” ?
It is a technique that takes a poor compressor A and turns it
into a compressor with better performance guarantee
Booster
A
T
cc’
Qualitatively, we would like to achieve

The more compressible is T, the shorter is c’
c’ is shorter than c, if T is compressible

Time(Aboost) = O(Time(A)), i.e. no slowdown

A is used as a black-box
Two Key Components: Burrows-Wheeler Transform and Suffix Tree
Paolo Ferragina, Università di Pisa
The emprirical entropy H0
H0(T) =
̶
∑i (ni/n) log2 (ni/n)
Frequency in T of the i-th symbol
 |T| H0(T) is the best you can hope for a memoryless compressor
 E.g. Huffman or Arithmetic coders come close to this bound
We get a better compression using a codeword that depends on
the k symbols preceding the one to be compressed (context)
Paolo Ferragina, Università di Pisa
The empirical entropy Hk
For any k-long context
 Compress T up to Hk(T)
Use Huffman
or Arithmetic
 compress all T[w] up to their H0
Hk(T) = (1/|T|) ∑|w|=k
| T[w] | H0(T[w])
 T[w] = string of symbols that precede w in T
Example: Given T = “mississippi”, we have
T[i]= mssp, T[is] = ms
Problems with this approach:
BWT
• How to go from all T[w] back to the string T ?
• How do we choose efficiently the best k ?
Paolo Ferragina, Università di Pisa
Suffix Tree
Use BWT to approximate Hk
bwt(T)
unknown
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
w
w
i
p
s
s
m
#
p
i
s
s
i
i
T = mississippi#
Paolo Ferragina, Università di Pisa
Remember that...
Hk(T) = (1/|T|)
∑|w|=k |T[w]| H0(T[w])
 Compress T up to Hk(T)
 compress all T[w] up to their H0
 compress pieces of bwt(T) up to H0
T[is] = ms
T[w] is a permutation of
a piece of bwt(T)
What are the pieces of BWT to compress ?
w
unknown
H1(T)
H2(T)
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
Bwt(T)
i
p
s
s
m
#
p
i
s
s
i
i
T[w]’s permutation
Compressing those pieces up to
their
H0 , we achieve
H1(T) up to
Compressing
those pieces
their H0 , we achieve H2(T)
Recall that
 Compress T up to Hk
 compress pieces of bwt(T) up to H0
Paolo Ferragina, Università di Pisa
We have a workable
way to approximate Hk
Finding the “best pieces” to compress...
unknown
12
11
9
5
2
1
10
9
7
4
6
3
#mississipp
i#mississip
ippi#missis
issippi#mis
ississippi#
mississippi
pi#mississi
ppi#mississ
sippi#missi
sissippi#mi
ssippi#miss
ssissippi#m
bwt(T)
i
p
s
s
m
#
p
i
s
s
i
i
Leaf cover ?
L1
L2
s
#
i
m
p
1
12
i
ssi
#
11
p
HH1(T)
2(T)
ppi#
#
10
9
s
pi#
i#
p
ppi#
ssippi#
si
i
9
i
ppi#
ssippi#
5
2
7
4
s
m
s
s
ssippi#
ppi#
6
3
i
i
Row order
Goal: find the
Some
best
leafBWT-partition
covers are “related”
induced to
byHak Leaf
!!! Cover !!
Paolo Ferragina, Università di Pisa
A compression booster
 Let
[Ferragina-Manzini, SODA 04]
A be the compressor we wish to boost
 Let bwt(T)=t1, …, tr be the partition induced by the leaf cover L, and let
us define cost(L,A)=∑j |A(tj)|
Goal: Find the leaf cover L* of minimum cost


It suffices a post-order visit of the suffix tree, hence linear time
We have: Cost(L*, A) ≤ Cost(Lk, A)  Hk(T), k
Technically, we show that
|c’ | ≤ λ |s| H (s) +f(|s|) + log2 |s| + gk k
0
k
Researchers may now concentrate on the “apparently”
simpler task of designing 0-th order compressors
Paolo Ferragina, Università di Pisa
[further results joint with Giancarlo-Sciortino]
May we close “the mutual reinforcement cycle” ?
 The Wavelet Tree
[Grossi-Gupta-Vitter, Soda 03]
 Using the WT within each piece of the optimal BWT-partition, we get:

A compressed index that scales well with the alphabet size
[joint with Manzini, Makinen, Navarro]

Reduce the compression problem to achieve H0 on binary strings
[joint with Giancarlo, Manzini,Sciortino]
Interesting issues:
 What about space construction of BWT ?
 What about these tools for “XML” or “Images” ?
 Other application contexts: bio,db,data mining, network, ...
 From Theory to Technology ! Libraries, Libraries,....
Paolo Ferragina, Università di Pisa
[e.g. LEDA]
Paolo Ferragina, Università di Pisa
Paolo Ferragina, Università di Pisa
A historical perspective
Shannon showed a “narrower” result for a stationary ergodic S

Idea: Compress groups of k chars in the string T

Result: Compress ratio  the entropy of S, for k  

Various limitations
Any string s

It works for a source S
Black-box

It must modify A’s structure, because of the alphabet change

For a given string T, the best k is found by trying k=0,1,…,|T|
O(|s|) time
Variable length
contexts
 W(|T|2) time slowdown
 k is eventually fixed and this is not an optimal choice !
Two Key Components: Burrows-Wheeler Transform and Suffix Tree
Paolo Ferragina, Università di Pisa
How do we find the “best” partition
 “Approximate” via MTF
(i.e. k)
[Burrows-Wheeler, ‘94]
 MTF is efficient in practice [bzip2]
 Theory and practice showed that we can aim for more !
 Use Dynamic Programming
[Giancarlo-Sciortino, CPM ’03]
 It finds the optimal partition
 Very slow, the time complexity is cubic in |T|
Surprisingly, full-text indexes help in finding
the optimal partition in optimal linear time !!
Paolo Ferragina, Università di Pisa
Example: not one k
Example
s = (bad)n (cad)n (xy)n (xz)n
1-long contexts
Paolo Ferragina, Università di Pisa
2-long contexts
as= d2n
xs= ynzn
< bas= dn , cas= dn
> yxs= yn-1 , zxs= zn-1
Word-based compressed index
What about word-based occurrences of P ?
word
prefix
substring
suffix
T = …
bzip…bzip2unbzip2unbzip…
P=bzip
...the post-processing phase can be time consuming !
The FM-index can be adapted to support word-based searches:

Preprocess T and transform it into a “digested” text DT
Word-search in T  Substring-search in DT

Use the FM-index over the “digested” DT
Paolo Ferragina, Università di Pisa
The WFM-index
Variant of Huffman algorithm:
 Symbols of the huffman tree are the words of T
 The Huffman tree has fan-out 128
 Codewords are byte-aligned and tagged
huffman
Any word
a
w
1
1
a 0 
[bzip]

[]
1
1

g
1
[]
g
0
g
[not]
Paolo Ferragina, Università di Pisa
0
a
a
[or]
0
1
a
0
w
0
r
Byte-aligned codeword
T = “bzip or not bzip”
yes
DT
r
Codeword
P= bzip = 1a 0
1
tagging
7 bits
no

0

[]
WFM-index
1
no
1. Dictionary of words
yes
a 0 
[bzip]
2. Huffman tree
3. FM-index built on DT
space ~ 22 %
word search ~ 4 ms
The BW-Trasform is invertible
F
#
i
i
i
i
m
p
p
s
s
s
s
unknown
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
Paolo Ferragina, Università di Pisa
L
i
p
s
s
m
#
p
i
s
s
i
i
Two key properties:
1. We can map L’s to F’s chars
2. T = .... L[ i ] F[ i ] ...
Reconstruct T backward:
T = .... i ppi #
Building the BWT  SA construction
Inverting BWT  array visit
...overall O(N) time...
The Wavelet Tree
[Grossi-Gupta-Vitter, Soda 03]
mississippi
00110110110
• Use WT within each BWT piece
miiii
10000
i
m
 Alphabet independence
sssspp
111100
p
 Binary string compression/indexing
s
[collaboration: Giancarlo, Manzini, Makinen, Navarro, Sciortino]
Paolo Ferragina, Università di Pisa
Scarica

s - Dipartimento di Informatica