Compressed Permuterm Index Paolo Ferragina Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) (Compacted) Trie Two versions: for D and for DR + Intersect answers Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain by g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) Permuterm Index (Garfield, 76) Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa Permuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings Permuterm Dictionary yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Paolo Ferragina, Università di Pisa Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Space problems Any query on D reduces to a prefix-query on P[D] The FM-index [Ferragina-Manzini, JACM ‘05] The result: Count(P): O(p) time Locate(P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| Hk(T) + o(|T| log |S|) bits New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Paolo Ferragina, Università di Pisa Compressed Permuterm index builds upon the best two features of the FM-index Third ingredient: FM-index substring search P = si unknown occ=2 [lr-fr+1] fr lr Paolo Ferragina, Università di Pisa #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L i p s s m # p i s s i i Count(P[1,p]): Finds <fr,lr> in O(p) time Compressed Permuterm Index Lexicographically sorted Z = $hat$hip$hop$hot$# Build FM-index to support substring searches Some queries are trivial... Prefix(a) = Substring search($a) within Z Suffix(b) = Substring search(b$) within Z Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa PrefixSuffix search unknown i=2 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] CLF[2] LF[2] Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) unknown $ho LF CLF PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF Paolo Ferragina, Università di Pisa No change in time/space bounds of compressed indexes Rank and Select of strings unknown Z = $hat$hip$hop$hot$# Other queries... Rank(s) = row of $s$ Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa A test on URLs Choose your trade-off % dict-size msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space • Time of 2060 Paolo Ferragina, Università di Pisa