Data structures: time, I/Os,
entropy, joules
Paolo Ferragina
Dipartimento di Informatica
Università di Pisa
Our driving moral...
Big steps come from theory
... but do NOT forget practice ;-)
Strings... why?
 Ubiquitous: any datum is a sequence of bits, hence a string
 Spur new problems in many areas:
Geometry

String-similarity search

Lower/upper-bounds to indexing  via reductions to geo-problem
 Points in high-dim space and NN-search
Graphs

Doc-doc similarity graph

Query-log graphs

Data compression
 ubiquitous in Text/Web mining
 edge iff 2 queries clicked on the same res-page
 Shortest paths on char-based weighted graphs
[Ferragina et al, SODA 09, ESA 09]
(String-)Dictionary
Problem
Given a dictionary D of K strings, of total length
N, store them in a way that we can efficiently
support prefix-searches for a pattern P.
Exact search  Hashing
Mitzenmacher, ESA invited ‘09
(Compacted)
Dominated the string-matching
[Fredkin, CACM
scene
1960]
in the ‘80s-90s with its suffix-version:
the Suffix Tree
Trie
Performance:
0
• Search ≈ O(|P|) time
• Space ≈ O(K + N)
1
y
2
stile
zyg
(2; 3,5)
5
1
etic
ial
2
3
s
z
aibelyite
2
omo
7
5
czecin
ygy
4
6
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
Timeline:
theory and practice...
What about
Software Engineers ??
(Compacted)
Dominated the string-matching
[Fredkin, CACM
scene
1960]
in the ‘80s-90s with its suffix-version:
the Suffix Tree
Trie
Performance:
0
• Search ≈ O(|P|) time
• Space ≈ O(K + N)
1
y
... But in practice…
2
• Search: random memory accesses stile
• Space: len + pointers + strings
zyg
(2; 3,5)
5
1
etic
ial
2
3
s
z
aibelyite
2
omo
7
5
czecin
ygy
4
6
systile syzygetic syzygial syzygy szaibelyite szczecin szomo
 What did systems implement?
Used the Compacted trie, of course, but with 2 other concerns
because of large data
1° issue: space concern
….systile syzygetic syzygial syzygy….
2
5
3345%
http://checkmate.com/All_Natural/
http://checkmate.com/All_Natural/Applied.html
http://checkmate.com/All_Natural/Aroma.html
http://checkmate.com/All_Natural/Aroma1.html
http://checkmate.com/All_Natural/Aromatic_Art.html
http://checkmate.com/All_Natural/Ayate.html
http://checkmate.com/All_Natural/Ayer_Soap.html
http://checkmate.com/All_Natural/Ayurvedic_Soap.html
http://checkmate.com/All_Natural/Bath_Salt_Bulk.html
http://checkmate.com/All_Natural/Bath_Salts.html
http://checkmate.com/All/Essence_Oils.html
http://checkmate.com/All/Mineral_Bath_Crystals.html
http://checkmate.com/All/Mineral_Bath_Salt.html
http://checkmate.com/All/Mineral_Cream.html
Front
Coding
5
0
33
34
38
38
34
35
35
33
42
25
25
38
33
http://checkmate.com/All_Natural/
Applied.html
roma.html
1.html
tic_Art.html
yate.html
er_Soap.html
urvedic_Soap.html
Bath_Salt_Bulk.html
s.html
Essence_Oils.html
Mineral_Bath_Crystals.html
Salt.html
Cream.html
0
http://checkmate.com/All/Natural/Washcloth.html...
http://checkmate.com/All/Natural/Washcloth.html
...
Bender et al., PODS 2006
Ferragina et al., PODS 2008
2° issue: hierarchical memory
track
M
Spatial locality or Temporal locality
caching: less I/Os
Less and Faster I/Os
1
CPU
Internal
Memory
Count I/Os
B
HD
2-level indexing
2 advantages:
• Search ≈ typically 1 I/O
• Space ≈ Front-coding over buckets
Internal
Memory
CT
on a sample
2 limitations:
• Sampling rate ≈ lengths of sampled strings
• Trade-off ≈ speed vs space
(because of bucket size)
systile szaielyite
(Prefix) B-tree
Disk
….0systile 2zygetic 5ial 5y
0szaibelyite 2czecin 2omo….
Timeline:
theory and practice...
Do we need to trade
space by I/Os ?
Space
+
Hierarchical Memory
[Morrison, J.ACM 1968]
An old idea: Patricia Trie
0
1
y
2
stile
zyg
etic
2
3
2
omo
7
5
ygy
ial
z
aibelyite
5
1
s
4
czecin
6
[Ferragina-Grossi, J.ACM 1999]
A new search
0
Search(P):
• Phase 1: tree navigation
• Phase 2: Compute LCP
• Phase 3: tree navigation
y
2
s
z
g < ya
z
5
1
Three-phase search:
P = syzyyea
s
P’s position
2
0
1
o
2
5
c
Only 1 string ise checked
y
i
Trie Space ≈ #strings,
NOT their length
….systile syzygetic syzygial syzygy szaibelyite szczecin szomo….
[Ferragina-Grossi, J.ACM 1999]
> 15 US-patents cite it !!
The String B-tree
[Handbook of Comp. Biology, 2009]
+
Search(P)
•O((p/B) logB K) I/Os
•O(occ/B) I/Os
1 string checked : O(p/B)
PT
29
13 20 18
3
O(logB K) levels
23
It is dynamic...
PT
29
2
PT
26 13
PT
29
1
9
20 25
PT
5
2
26
10
4
PT
6
PT
7 13
Lexicographic position of P
20 16
28
18
3 14
PT
8 25
6
12 15 22 18
21
23
PT
3
27
24 11
PT
14
21
17
23
Knuth, vol 3°, pag. 489: “elegant”
I/O-aware
algorithms & data structures
I/Os
was the
main concern
[CACM 1988]
[2006]
Huge literature !!
Timeline:
theory and practice...
Not just 2 memory levels
Space
CPU
registers
 Parameter-free solutions

L1
L2
Cache
Anywhere, anytime, anyway... I/O-optimal !!
RAM
HD
net
Cache-oblivious Algo. and Data Str.
See chap by Arge, Brodal, Fagerberg
Some precious achievements...
 Cache-oblivious trie
 Static dictionary of strings [Brodal et al, SODA 2006]
 Cache-oblivious String B-tree
 Dynamic dictionary of strings [Bender et al, PODS 2006]
 Cache-oblivious tree mapping
 Patricia Trie
 Split-and-Refine that applies to any B-fixed tree partitioning
[Alstrup et al, manuscript 2003]
 Worst-case solution [Demaine et al, manuscript 2004]
Timeline:
theory and practice...
Not just 2 memory levels
Cache-oblivious
data structures
Compressed
data structures
Space
A challenging question
[Ken Church, AT&T 1995]
Soft. Eng. use many “squeezing heuristics” that
compress data and still support fast access to them
Can we “automate” and “guarantee” the process ?
Aka: Compressed self-indexes
Opportunistic Data Structures with Applications
P. Ferragina, G. Manzini
...now, J.ACM 2005


Space for text + (full-text) index  compressed text ( Hk)
Query/Decompression time  theoretically (quasi-)optimal
[Burrows-Wheeler, 1994]
The big (unconscious) step...
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
Can we compress it ?
[Burrows-Wheeler, 1994]
The big (unconscious) step...
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
bwt(T)
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
T
bzip2 = BWT + other simple compressors
[Burrows-Wheeler, 1994]
The big (unconscious) step...
Let us given a text T = mississippi#
mississippi#
ississippi#m
ssissippi#mi
sissippi#mis
issippi#miss
ssippi#missi
sippi#missis
ippi#mississ
ppi#mississi
pi#mississip
i#mississipp
#mississippi
Sort the rows
bwt(T)
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
T
Suffix Array
bzip2 = BWT + other simple compressors
From practice to theory...
bwt(T)
#
i
i
i
i
m
p
p
s
s
s
s
mississipp
#mississip
ppi#missis
ssippi#mis
ssissippi#
ississippi
i#mississi
pi#mississ
ippi#missi
issippi#mi
sippi#miss
sissippi#m
i
p
s
s
m
#
p
i
s
s
i
i
FM-index = BWT is searchable
...or
Suffix Array is compressible
• Space = l |T| Hk + o(|T|) bits
• Search(P) = O(p + occ * polylog(|T|))
Nowadays tons of papers: theory & experiments
[Navarro-Makinen, ACM Comp. Surveys 2007]
Compressed & Searchable data formats
Texts
Trees
Graphs
SODA 2002
…
FOCS 2008
STACS 2009
SODA 2002
SODA 2007
ICALP 2007
SWAT 2008
ICALP 2009
SODA 2010
DCC 2001
WWW 2004
ISAAC 2007
ESA 2008
FOCS 2009
Labeled Trees
Functions
Point Sets
SODA 2002
FOCS 2005
WWW 2006
SODA 2007
ICDE 2010
ICALP 2003, 04
SODA 2004
ICALP 2008
ESA 2009
LATIN 2010
SODA 2003
TALG 2007
WADS 2009
SODA 2009
FOCS 2000
SODA 2003, 04
SODA 2007
SPIRE 2007
CPM 2008
CPM 2010
ICALP 2010
Integer Sets
Images
DCC 2008
[December 2003]
[January 2005]
ACM J. on Experimental Algorithmics, 2009
> 103 faster than Smith-W.
>102 faster than SOAP & Maq
What about the Web ?
[Ferragina-Manzini, ACM WSDM 2010]
Where we are nowadays
Cache-oblivious
data structures
Compressed
data structures
Something is known... yet very preliminary
[PODS ‘08, Navarro, Vitter, ...]
Bellazougui et al, this ESA
What else...
[E. Gal, S. Toledo. ACM Comp. Surv., 2005]
[Ajwani et al, WEA 2009]
 Solid-state disks: no mechanical parts
 ... very fast reads, but slow writes & wear leveling
 Self-adjusting or Weighted design
 Time ops depend on some (un/known) distribution
 Challenging : no pointers, self-adjust (perf) vs compression (space)
A bigger challenge:
from micro to macro !
IEEE Computer, 2007
Approach #1
(engineering oriented)
News: Proper system components + specific algorithms
 Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Sort]
Approach #2
(Manage resources)
Goal: Develop on-line algorithms that dynamically manage
power by trading off performance, energy and reliability
 Susanne Albers, Comm. ACM 2010
Approach #3
(models and algorithms)
IEEE Computer, 2009
“Algorithmics offers benefits that extend far
beyond TCS into the design of systems.”
Sometimes energy is a primary resource!
Energy-aware Algo+Ds ?
Memory-level impacts
I/Os and compression
are obviously important
BUT
here there is a new twist
MIPS per Watt ?
Battery life !!
Who cares whether your application:
1. is y% slower than optimal, but it is more energy efficient ?
2. occupies x% more space than optimal, but decompr is faster ?
MIPS per Watt ?
Idea:
Multi-objective optimization
in data-structure design
Battery life !!
Stay tuned:
Algorithm Library
for Mobile Phones
v
Hbase - Hadoop
BigTable, 2006
Cosmos
HyperTable
Cassandra
Real-time search
Q&A social search
Knowledge search
Many ingredients
 Items are graphs, vectors, strings, …
 Number and size are VERY large
 Involve many resources to be optimized:

Time (speed/patience)

Space (#disks/management costs)

Bandwidth (speed/€)

Energy (€)
Multi-objective optimization
in data-structure design!
That’s all !
 Look at my paper in the proceedings
Scarica

Web Algorithmics - Dipartimento di Informatica