The Future of Web Search Barcelona, May 2006 XML Compression and Indexing Paolo Ferragina Dipartimento di Informatica, Università di Pisa [Joint with F. Luccio, G. Manzini, S. Muthukrishnan] Paolo Ferragina, Università di Pisa Under patenting by Pisa-Rutgers Univ. Compressed Permuterm Index Paolo Ferragina, Rossano Venturini Dipartimento di Informatica, Università di Pisa Paolo Ferragina, Università di Pisa Under Y!-patenting A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all strings in D that are prefixed by a 3) Suffix(b): find all strings in D that are suffixed by b 4) Substring(g): find all strings in D that contain g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) IR book of Manning-Raghavan-Schutze Tolerant Retrieval Problem (wildcards) Paolo Ferragina, Università di Pisa Prefix(a) = a* Suffix(b) = *b Substring(g) = *g* PrefixSuffix(a,b) = a*b A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) Hashing Not exact searches Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) (Compacted) Trie Two versions: for D and for DR + Intersect answers No substring search (unless using Suffix Trie) Need to store D for resolving edge-labels Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, design a compressed data structure that supports 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) Front coding... Paolo Ferragina, Università di Pisa Front-coding uk-2002 crawl ≈250Mb http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html 3035% 0 33 34 38 38 34 35 35 33 42 25 25 38 33 http://checkmate.com/All_Natural/ Applied.html roma.html 1.html tic_Art.html yate.html er_Soap.html urvedic_Soap.html Bath_Salt_Bulk.html s.html Essence_Oils.html Mineral_Bath_Crystals.html Salt.html Cream.html 0 http://checkmate.com/All/Natural/Washcloth.html... http://checkmate.com/All/Natural/Washcloth.html ... bzip ≈ 10% Be back on this, later on! Two versions: for D and for DR + Intersect answers Need some extra data structures for bucket identification No substring search Paolo Ferragina, Università di Pisa A basic problem Given a dictionary D of strings, having variable length, compress them in a way that we can efficiently support 1) string id 2) Prefix(a): find all s in D that are prefixed by a 3) Suffix(b): find all s in D that are suffixed by b 4) Substring(g): find all s in D that contain by g 5) PrefixSuffix(a,b) = Prefix(a) Suffix(b) Permuterm Index (Garfield, 76) Reduce any query to a “prefix query” over a larger dictionary Paolo Ferragina, Università di Pisa Premuterm Index [Garfield, 1976] Take a dictionary D={yahoo,google} 1. Append a special char $ to the end of each string 2. Generate all rotations of these strings Permuterm Dictionary yahoo$ ahoo$y hoo$ya oo$yah o$yaho $yahoo google$ oogle$g ogle$go gle$goo le$goog e$googl $google Paolo Ferragina, Università di Pisa Prefix(ya) = Prefix($ya) Suffix(oo) = Prefix(oo$) Substring(oo) = Prefix(oo) PrefixSuffix(y,o)= Prefix(o$y) Space problems Any query on D reduces to a prefix-query on P[D] Compressed Permuterm Index SIGIR ‘07 It deploys two ingredients: Permuterm index Compressed full-text index Theoretically: Query ops take optimal time: proportional to pattern length Space occupancy is |D| Hk(D) + o(|D| log |S|) bits Technically: A simple reduction step: Permuterm Compressed index Re-use known machinery on compressed indexes Achieve bzip-compression at Front-coding speed Paolo Ferragina, Università di Pisa The Burrows-Wheeler Transform (1994) Take the text T = mississippi# mississippi# ississippi#m ssissippi#mi sissippi#mis issippi#miss ssippi#missi sippi#missis ippi#mississ ppi#mississi pi#mississip i#mississipp #mississippi Paolo Ferragina, Università di Pisa Sort the rows F #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L i p s s m # p i s s i i T Compressing L is effective Key observation: L is locally homogeneous L is highly compressible Bzip vs. Gzip: 20% vs. 33%, but it is slower in (de)compression ! Paolo Ferragina, Università di Pisa The FM-index [Ferragina-Manzini, JACM ‘05] Survey of Navarro-Makinen contains many other indexes The result: Count(P): O(p) time Locate(P): O(occ * polylog(|T|)) time Display( T[i,i+L] ): O( L + polylog(|T|) ) time Space occupancy: |T| Hk(T) + o(|T| log |S|) bits New concept: The FM-index is an opportunistic data structure The main idea is to reduce substring search to some basic operations over arrays of symbols Paolo Ferragina, Università di Pisa Compressed Permuterm index builds upon the best two features of the FM-index First ingredient: L F # i i i i m p p s s s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i F mapping How do we map L’s onto F’s chars ? ... Need to distinguish equal chars in F... Take two equal L’s chars Rotate rightward their rows Same relative order !! First ingredient: L 1 2 6 7 9 F # i i i i m p p s s s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# ississippi i#mississi pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m L i p s s m # p i s s i i F mapping The oracle Rank( s , 9 ) = 3 FM-index is actually Rank ds over BWT O(1) time and Hk-space Paolo Ferragina, Università di Pisa Second ingredient: Backward step F # i i i i m p p s s s s unknown mississipp #mississip ppi#missis ssippi#mis ssissippi# LF ississippi i#mississi LF pi#mississ ippi#missi issippi#mi sippi#miss sissippi#m Paolo Ferragina, Università di Pisa L i p s s m # p i s s i i T scanned backward by using LF-mapping ...s s i... Backward step(i): Return LF[i], in O(1) time Third ingredient: substring search P = si unknown occ=2 [lr-fr+1] fr lr Paolo Ferragina, Università di Pisa #mississipp i#mississip ippi#missis issippi#mis ississippi# mississippi pi#mississi ppi#mississ sippi#missi sissippi#mi ssippi#miss ssissippi#m L i p s s m # p i s s i i Count(P[1,p]): Finds <fr,lr> in O(p) time The Comprressed Permuterm Lexicographically sorted Z = $hat$hip$hop$hot$# Build FM-index to support substring searches Some queries are trivial... Prefix(a) = Substring search($a) within Z Suffix(b) = Substring search(b$) within Z Substr(g) = Substring search(g) within Z Paolo Ferragina, Università di Pisa PrefixSuffix search unknown i=3 Key property: Last char of si is at L[i+1] Cyclic-LF[i] If (i > #D) return LF[i] else return LF[i+1] LF[3] CLF[3] Paolo Ferragina, Università di Pisa PrefixSuffix(ho,p) unknown $ho LF CLF PrefixSuffix(P): Search FM-index of Z using Cyclic-LF instead of LF Paolo Ferragina, Università di Pisa No change in time/space bounds of compressed indexes Rank and Select of strings unknown Z = $hat$hip$hop$hot$# Other queries... Rank(s) = row of $s$ Select(i)= backw from L[i+1] Paolo Ferragina, Università di Pisa Experiments Three dictionaries: Term dictionary: Trec WT10G Host dictionary (reversed): UK-2005 Url dictionary (host reversed): first 190Mb of UK-2005 Term Host Url size 118 Mb 34 Mb 190 Mb # strings 10 Mil 2 Mil 3 Mil FC 40% 45% 30% bzip 33% 25% 10% Paolo Ferragina, Università di Pisa PrefixSuffix search needs *2 Paolo Ferragina, Università di Pisa A test on URLs MRS book says: “one Choose your trade-off % dict-size disadvantage of the PI is that its dictionary becomes quite large, including as it does all rotations of each term”. Now, they mention CPI msec/char, and space close to bzip • Time close to Front-Coding (4 msec/char), but <50% of its space • Time of 2060 Paolo Ferragina, Università di Pisa We proposed an approach for dictionary storage: + Theory: optimal time and entropy-bounds for space + Practice: trades time vs space, thus fitting user needs Paolo Ferragina, Università di Pisa