Data structures: time, I/Os, entropy, joules Paolo Ferragina Dipartimento di Informatica Università di Pisa Our driving moral... Big steps come from theory ... but do NOT forget practice ;-) Strings... why? Ubiquitous: any datum is a sequence of bits, hence a string Spur new problems in many areas: Geometry String-similarity search Lower/upper-bounds to indexing via reductions to geo-problem Points in high-dim space and NN-search Graphs Doc-doc similarity graph Query-log graphs Data compression ubiquitous in Text/Web mining edge iff 2 queries clicked on the same res-page Shortest paths on char-based weighted graphs [Ferragina et al, SODA 09, ESA 09] (String-)Dictionary Problem Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix-searches for a pattern P. Exact search Hashing Mitzenmacher, ESA invited ‘09 (Compacted) Dominated the string-matching [Fredkin, CACM scene 1960] in the ‘80s-90s with its suffix-version: the Suffix Tree Trie Performance: 0 • Search ≈ O(|P|) time • Space ≈ O(K + N) 1 y 2 stile zyg (2; 3,5) 5 1 etic ial 2 3 s z aibelyite 2 omo 7 5 czecin ygy 4 6 systile syzygetic syzygial syzygy szaibelyite szczecin szomo Timeline: theory and practice... What about Software Engineers ?? (Compacted) Dominated the string-matching [Fredkin, CACM scene 1960] in the ‘80s-90s with its suffix-version: the Suffix Tree Trie Performance: 0 • Search ≈ O(|P|) time • Space ≈ O(K + N) 1 y ... But in practice… 2 • Search: random memory accesses stile • Space: len + pointers + strings zyg (2; 3,5) 5 1 etic ial 2 3 s z aibelyite 2 omo 7 5 czecin ygy 4 6 systile syzygetic syzygial syzygy szaibelyite szczecin szomo What did systems implement? Used the Compacted trie, of course, but with 2 other concerns because of large data 1° issue: space concern ….systile syzygetic syzygial syzygy…. 2 5 3345% http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html Front Coding 5 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http://checkmate.com/All_Natural/ Applied.html roma.html 1.html tic_Art.html yate.html er_Soap.html urvedic_Soap.html Bath_Salt_Bulk.html s.html Essence_Oils.html Mineral_Bath_Crystals.html Salt.html Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... http://checkmate.com/All/Natural/Washcloth.html ... Bender et al., PODS 2006 Ferragina et al., PODS 2008 2° issue: hierarchical memory track M Spatial locality or Temporal locality caching: less I/Os Less and Faster I/Os 1 CPU Internal Memory Count I/Os B HD 2-level indexing 2 advantages: • Search ≈ typically 1 I/O • Space ≈ Front-coding over buckets Internal Memory CT on a sample 2 limitations: • Sampling rate ≈ lengths of sampled strings • Trade-off ≈ speed vs space (because of bucket size) systile szaielyite (Prefix) B-tree Disk ….0systile 2zygetic 5ial 5y 0szaibelyite 2czecin 2omo…. Timeline: theory and practice... Do we need to trade space by I/Os ? Space + Hierarchical Memory [Morrison, J.ACM 1968] An old idea: Patricia Trie 0 1 y 2 stile zyg etic 2 3 2 omo 7 5 ygy ial z aibelyite 5 1 s 4 czecin 6 [Ferragina-Grossi, J.ACM 1999] A new search 0 Search(P): • Phase 1: tree navigation • Phase 2: Compute LCP • Phase 3: tree navigation y 2 s z g < ya z 5 1 Three-phase search: P = syzyyea s P’s position 2 0 1 o 2 5 c Only 1 string ise checked y i Trie Space ≈ #strings, NOT their length ….systile syzygetic syzygial syzygy szaibelyite szczecin szomo…. [Ferragina-Grossi, J.ACM 1999] > 15 US-patents cite it !! The String B-tree [Handbook of Comp. Biology, 2009] + Search(P) •O((p/B) logB K) I/Os •O(occ/B) I/Os 1 string checked : O(p/B) PT 29 13 20 18 3 O(logB K) levels 23 It is dynamic... PT 29 2 PT 26 13 PT 29 1 9 20 25 PT 5 2 26 10 4 PT 6 PT 7 13 Lexicographic position of P 20 16 28 18 3 14 PT 8 25 6 12 15 22 18 21 23 PT 3 27 24 11 PT 14 21 17 23 Knuth, vol 3°, pag. 489: “elegant” I/O-aware algorithms & data structures I/Os was the main concern [CACM 1988] [2006] Huge literature !! Timeline: theory and practice... Not just 2 memory levels Space CPU registers Parameter-free solutions L1 L2 Cache Anywhere, anytime, anyway... I/O-optimal !! RAM HD net Cache-oblivious Algo. and Data Str. See chap by Arge, Brodal, Fagerberg Some precious achievements... Cache-oblivious trie Static dictionary of strings [Brodal et al, SODA 2006] Cache-oblivious String B-tree Dynamic dictionary of strings [Bender et al, PODS 2006] Cache-oblivious tree mapping Patricia Trie Split-and-Refine that applies to any B-fixed tree partitioning [Alstrup et al, manuscript 2003] Worst-case solution [Demaine et al, manuscript 2004] Timeline: theory and practice... Not just 2 memory levels Cache-oblivious data structures Compressed data structures Space A challenging question [Ken Church, AT&T 1995] Soft. Eng. use many “squeezing heuristics” that compress data and still support fast access to them Can we “automate” and “guarantee” the process ? Aka: Compressed self-indexes Opportunistic Data Structures with Applications P. Ferragina, G. Manzini ...now, J.ACM 2005 Space for text + (full-text) index compressed text ( Hk) Query/Decompression time theoretically (quasi-)optimal [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows # i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i Can we compress it ? [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows bwt(T) # i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i T bzip2 = BWT + other simple compressors [Burrows-Wheeler, 1994] The big (unconscious) step... Let us given a text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Sort the rows bwt(T) # i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i T Suffix Array bzip2 = BWT + other simple compressors From practice to theory... bwt(T) # i i i i m p p s s s s mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m i p s s m # p i s s i i FM-index = BWT is searchable ...or Suffix Array is compressible • Space = l |T| Hk + o(|T|) bits • Search(P) = O(p + occ * polylog(|T|)) Nowadays tons of papers: theory & experiments [Navarro-Makinen, ACM Comp. Surveys 2007] Compressed & Searchable data formats Texts Trees Graphs SODA 2002 … FOCS 2008 STACS 2009 SODA 2002 SODA 2007 ICALP 2007 SWAT 2008 ICALP 2009 SODA 2010 DCC 2001 WWW 2004 ISAAC 2007 ESA 2008 FOCS 2009 Labeled Trees Functions Point Sets SODA 2002 FOCS 2005 WWW 2006 SODA 2007 ICDE 2010 ICALP 2003, 04 SODA 2004 ICALP 2008 ESA 2009 LATIN 2010 SODA 2003 TALG 2007 WADS 2009 SODA 2009 FOCS 2000 SODA 2003, 04 SODA 2007 SPIRE 2007 CPM 2008 CPM 2010 ICALP 2010 Integer Sets Images DCC 2008 [December 2003] [January 2005] ACM J. on Experimental Algorithmics, 2009 > 103 faster than Smith-W. >102 faster than SOAP & Maq What about the Web ? [Ferragina-Manzini, ACM WSDM 2010] Where we are nowadays Cache-oblivious data structures Compressed data structures Something is known... yet very preliminary [PODS ‘08, Navarro, Vitter, ...] Bellazougui et al, this ESA What else... [E. Gal, S. Toledo. ACM Comp. Surv., 2005] [Ajwani et al, WEA 2009] Solid-state disks: no mechanical parts ... very fast reads, but slow writes & wear leveling Self-adjusting or Weighted design Time ops depend on some (un/known) distribution Challenging : no pointers, self-adjust (perf) vs compression (space) A bigger challenge: from micro to macro ! IEEE Computer, 2007 Approach #1 (engineering oriented) News: Proper system components + specific algorithms Sanders & Meyer’s groups, IEEE Conf. on Green Comp. 2010 [SSDisks + Atom + Sort] Approach #2 (Manage resources) Goal: Develop on-line algorithms that dynamically manage power by trading off performance, energy and reliability Susanne Albers, Comm. ACM 2010 Approach #3 (models and algorithms) IEEE Computer, 2009 “Algorithmics offers benefits that extend far beyond TCS into the design of systems.” Sometimes energy is a primary resource! Energy-aware Algo+Ds ? Memory-level impacts I/Os and compression are obviously important BUT here there is a new twist MIPS per Watt ? Battery life !! Who cares whether your application: 1. is y% slower than optimal, but it is more energy efficient ? 2. occupies x% more space than optimal, but decompr is faster ? MIPS per Watt ? Idea: Multi-objective optimization in data-structure design Battery life !! Stay tuned: Algorithm Library for Mobile Phones v Hbase - Hadoop BigTable, 2006 Cosmos HyperTable Cassandra Real-time search Q&A social search Knowledge search Many ingredients Items are graphs, vectors, strings, … Number and size are VERY large Involve many resources to be optimized: Time (speed/patience) Space (#disks/management costs) Bandwidth (speed/€) Energy (€) Multi-objective optimization in data-structure design! That’s all ! Look at my paper in the proceedings