Algorithms to Search Position Specific Scoring Matrices in Biosequences Cinzia Pizzi Dipartimento di Ingegneria dell’Informazione Università degli Studi di Padova Outline Weighted patterns in Biology The problem of profile matching The look-ahead method Suffix based Algorithms Aho-Corasick Extension (ACE) Look-ahead Filtration Algorithm (LFA) Superalphabet (NS) Some experimental results C.Pizzi, DEI – Univ. Of Padova (Italy) 2 What are Motifs? Motifs are biologically significant elements that are responsible for common structures or functions Motifs are statistically significant substrings in bio-sequences Assumption: if two entities share same function or structure, common overrepresented elements might be responsible for observed similarity C.Pizzi, DEI – Univ. Of Padova (Italy) 3 Motif Discovery Take set of co-expressed genes Compare their promoter regions Common over-represented substrings are good candidates for TFBS Need counted/expected frequency Promoters of co-expressed genes C.Pizzi, DEI – Univ. Of Padova (Italy) 4 Motif Discovery TFBS, DNA motifs Motifs = binding sites = substrings Promoters of co-expressed genes Intrinsic variability of biological sequences Mismatches, indels, wildcards, superalphabets... C.Pizzi, DEI – Univ. Of Padova (Italy) 5 Motif Representation Binding sites of the same factor are not exactly the same in all sequences ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA 1 2 3 4 5 6 A C G T Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy) Motif Representation Protein classification: each family is modeled by a matrix 1 2 3 4 5 6 ACDEHNPVAC CCDEGAMMAT ATHCATVVST 1 2 3 4 5 6 A A C D ... 1 2 3 4 5 6 A C D ... WVDEHNPVAC C.Pizzi, DEI – Univ. Of Padova (Italy) C D ... Profile Weighted pattern p oflength m defined over alphabet Σ |Σ| x m matrix defines scores 1 A 0.3 C 0.1 G 0.2 T 0.4 2 0.0 0.8 0.0 0.2 3 0.1 0.5 0.4 0.0 4 0.2 0.2 0.3 0.3 5 1.0 0.0 0.0 0.0 C.Pizzi, DEI – Univ. Of Padova (Italy) 6 0.3 0.4 0.0 0.3 Segment Score A C G T 1 0.3 0.1 0.2 0.4 2 0.0 0.8 0.0 0.2 s1 s2 S = s1 s2 … sm 3 0.1 0.5 0.4 0.0 4 0.2 0.2 0.3 0.3 s3 s4 5 1.0 0.0 0.0 0.0 s5 m 6 0.3 0.4 0.0 0.3 s6 Score M [ si , i ] i 1 C.Pizzi, DEI – Univ. Of Padova (Italy) Meaning of the score m f si , i m f f si , i P( S | M ) si , i i 1 Score M [ si , i ] ln ln ln m P( S | B) i 1 i 1 i 1 psi p si p s m m i 1 C.Pizzi, DEI – Univ. Of Padova (Italy) i 10 Segment Score Example A C G T 1 0.3 0.1 0.2 0.4 2 0.0 0.8 0.0 0.2 3 0.1 0.5 0.4 0.0 4 0.2 0.2 0.3 0.3 5 1.0 0.0 0.0 0.0 6 0.3 0.4 0.0 0.3 G T A C A C Score = 2.1 C.Pizzi, DEI – Univ. Of Padova (Italy) Profile Matching Problem Text T of length n defined over Σ Profile p (|Σ| x m) Score threshold th Score Si of the segment of length m starting at position i Find all positions i in T where Si ≥ th C.Pizzi, DEI – Univ. Of Padova (Italy) Example: th = 2 CGTACACTCGGTA Score = 0.6 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 2.1 Match at pos 2! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 1.8 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 0.9 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 1.3 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 1.4 Not a match! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Example: th = 2 CGTACACTCGGTA Score = 2.2 Match at pos 8! 1 2 3 4 6 A 0.3 0.0 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) 0.1 0.2 5 1.0 0.3 Scenarios of applications Online Algorithms (no indexing) Database of profile matrices (e.g. TRANSFAC, JASPAR for TFBS) Input sequence to be searched Offline algorithms (indexing) Sequence or set of sequences Input matrix to search for matches C.Pizzi, DEI – Univ. Of Padova (Italy) Summary of current methods Look-ahead method LA (Wu et al,00) Offline methods based on LA: Suffix-tree (Dorohonceanu et al, 00) Suffix-array (Beckstette et al, 04,06) Truncated Suffix Tree (Pizzi and Favaretto, 10) Online methods based on LA: Aho-Corasick,Filtering(Pizzi et al. 07,09) C.Pizzi, DEI – Univ. Of Padova (Italy) Summary of current methods Pattern Matching Shift-Add (Salmela e Tarhio, 08) KMP (Liefoghee et al, 09) Matrix partitioning (Liefhooghe et al.,06, Pizzi et al., 07, 09) FFT based (Rajasekaran et al., 02) Compression based(Freschi et al., 05) C.Pizzi, DEI – Univ. Of Padova (Italy) The look-ahead approach 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 3.0 2.2 1.7 1.4 0.4 -1.0 -0.2 0.3 0.6 1.6 2.0 max Pth C.Pizzi, DEI – Univ. Of Padova (Italy) m max[ i ] M [ si , i ] k i Pth [i ] th max[ i 1] The look-ahead approach 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 3.0 2.2 1.7 1.4 0.4 -1.0 -0.2 0.3 0.6 1.6 2.0 T A C A max Pth C 0.1 G C.Pizzi, DEI – Univ. Of Padova (Italy) The look-ahead approach 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 3.0 2.2 1.7 1.4 0.4 -1.0 -0.2 0.3 0.6 1.6 2.0 T A C A max Pth C 0.1 G 0.1 C.Pizzi, DEI – Univ. Of Padova (Italy) The look-ahead approach 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 3.0 2.2 1.7 1.4 0.4 -1.0 -0.2 0.3 0.6 1.6 2.0 T 0.1 A C A max Pth C 0.1 G 0.1 Don’t need to compare these ones! C.Pizzi, DEI – Univ. Of Padova (Italy) The suffix tree of T data structure suffix tree, Tree(T), is compacted trie that represents all the suffixes of string T linear size: |Tree(T)| = O(|T|) can be constructed in linear time O(|T|) C.Pizzi, DEI – Univ. Of Padova (Italy) Suffix trie and suffix tree abaab baab aab ab b Trie(abaab) a b a a b Tree(abaab) a b a a b baab a b C.Pizzi, DEI – Univ. Of Padova (Italy) ab baab Tree(T) is of linear size only the internal branching nodes and the leaves represented explicitly edges labeled by substrings of T v = node(α) if the path from root to v spells α one-to-one correspondence of leaves and suffixes |T| leaves, hence < |T| internal nodes C.Pizzi, DEI – Univ. Of Padova (Italy) 30 Tree(hattivatti) vatti hattivatti t i attivatti ttivatti tivatti hattivatti ti atti vatti i i vatti ivatti vatti atti tti ti vatti vatti tti atti hattivatti attivatti ttivatti i C.Pizzi, DEI – Univ. Of Padova (Italy) vatti ti tivatti ivatti Tree(hattivatti) vatti hattivatti 3,3 attivatti ttivatti tivatti hattivatti 2,5 4,5 ivatti vatti tti ti i vatti 6,10 6,10 atti 1 7 2 6 10 9 8 3 hattivatti C.Pizzi, DEI – Univ. Of Padova (Italy) 4 5 Tree(T) is full text index Tree(T) P occurs in T at locations 8, 31, … P 31 8 P occurs in T P is a prefix of some suffix of T Path for P exists in Tree(T) All occurrences of P in time O(|P| + #occ) C.Pizzi, DEI – Univ. Of Padova (Italy) LA over a Suffix Tree CG T TCC G Score(CG)=0.2 > -0.2 = Th(2) Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) LA over a Suffix Tree CG T TCC G Score(TCC)=1.9 > 0.3 = Th(3) Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree C.Pizzi, DEI – Univ. Of Padova (Italy) Suffix array: example hattivatti ε 11 attivatti atti 7 ttivatti attivatti 2 tivatti hattivatti 1 ivatti i 10 vatti ivatti 5 atti ti 9 tti tivatti 4 ti tti 8 i ttivatti 3 ε vatti 6 C.Pizzi, DEI – Univ. Of Padova (Italy) suffix array = lexicographic order of the suffixes Suffix array suffix array SA(T) = an array giving the lexicographic order of the suffixes of T practitioners like suffix arrays (simplicity, space efficiency) theoreticians like suffix trees (explicit structure) C.Pizzi, DEI – Univ. Of Padova (Italy) 37 LA over a Suffix Array In terms of suffix trees, skp[i] is the lexicographically next leaf that does not occur in the subtree below the branching node corresponding to the longest common prefix of Ssuf[i-1] and Ssuf[i]. skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j]) C.Pizzi, DEI – Univ. Of Padova (Italy) LA over Truncated ST Build TST with truncation factor h L = max length of a matrix in the DB if h=L, simply work as ST if h<L, filtering if a leaf is reached take corresponding positions (p1, p2, …, pt) For each pi check positions pi+j, h<j<=m with lookahead C.Pizzi, DEI – Univ. Of Padova (Italy) 39 LA over Truncated ST h L p1 p2 p3 p1 p1 + h L-h p2 p2 +h L-h C.Pizzi, DEI – Univ. Of Padova (Italy) p3 p3 +h L-h 40 Space OccupationTruST C.Pizzi, DEI – Univ. Of Padova (Italy) 41 Running Time TruST C.Pizzi, DEI – Univ. Of Padova (Italy) 42 Online Profile Matching Aho-Corasick Expansion (ACE) Lookahead Filtration Algorithm(LFA) Pattern matching + LA Score for fixed length prefix as a filter + LA Naive Superalphabet (NS) Encode k-mers in superalphabet symbol C.Pizzi, DEI – Univ. Of Padova (Italy) The Aho-Corasick Algorithm A trie for D = {he, she, his, hers} C.Pizzi, DEI – Univ. Of Padova (Italy) The Aho-Corasick algorithm Add failure links his -- she Time O(n+m) Space depends on D m = sum of word lengths C.Pizzi, DEI – Univ. Of Padova (Italy) The Fast Aho-Corasick h h,s h 0 r h e,i 1 e 2 s s i s Time O(n) Space depends on D and Σ e,i,r 6 3 s C.Pizzi, DEI – Univ. Of Padova (Italy) h 4 r s e 8 7 5 s 9 AC and profile matching Build AC automaton for all the words that are a match for the matrix LA partial threshold limits the number of words to those that actually match O(|D||Σ|m + m|Σ|) pre-processing |D|≤|Σ|m depends on matrix and threshold Search the text with AC automaton O(n) search C.Pizzi, DEI – Univ. Of Padova (Italy) AC-Extension by LA First position [C,0.1] 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 -1.0 -0.2 0.3 0.6 1.6 2.0 Pth C.Pizzi, DEI – Univ. Of Padova (Italy) [G,0.2] [T,0.4] [A,0.3] AC-Extension by LA Second position [C,0.1] 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 -1.0 -0.2 0.3 0.6 1.6 2.0 Pth [G,0.2] [T,0.4] [A,0.3] [A,0.1] [G,0.1] C.Pizzi, DEI – Univ. Of Padova (Italy) [T,0.3] [C,0.9] AC-Extension by LA Third position [C,0.1] 1 2 3 4 5 6 A 0.3 0.0 0.1 0.2 1.0 0.3 C 0.1 0.8 0.5 0.2 0.0 0.4 G 0.2 0.0 0.4 0.3 0.0 0.0 T 0.4 0.2 0.0 0.3 0.0 0.3 -1.0 -0.2 0.3 0.6 1.6 2.0 Pth [G,0.2] [T,0.4] [A,0.3] [A,0.1] [G,0.1] [G,0.5] [C,0.6] C.Pizzi, DEI – Univ. Of Padova (Italy) [T,0.3] [C,0.9] ACE Example 1 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 2 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 3 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 4 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 5 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 6 CGTACACTCGGTA t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) ACE Example 7 CGTACACTCGGTA Match at p-m+1 = 7-6+1=2 t g c a t g a t c a c g c C.Pizzi, DEI – Univ. Of Padova (Italy) Minimum Gain for ACE Dual Concept of look-ahead Compute for every prefix the minimum contribution of the remaining positions in the pattern If current_score(i) + min_gain(i) > Th Report a match Adv: in the automaton save a full subtree of height m-i C.Pizzi, DEI – Univ. Of Padova (Italy) Example: M0003, MSS=0.85 [G,18500] C.Pizzi, DEI – Univ. Of Padova (Italy) Example: M0003, MSS=0.85 [G,18500] [C,37000] C.Pizzi, DEI – Univ. Of Padova (Italy) Example: M0003, MSS=0.85 [G,18500] [C,37000] [C,55500] • GCC is sufficient to detect a match C.Pizzi, DEI – Univ. Of Padova (Italy) h=3 Example: M0003, MSS=0.85 [G,18500] [C,37000] h=3 [C,55500] • Save 5464 nodes out of 5468 C.Pizzi, DEI – Univ. Of Padova (Italy) mh i | A | i 1 Minimum Gain ACE C.Pizzi, DEI – Univ. Of Padova (Italy) Look-ahead Filtration Compute the scores for all words of fixed length k and store them O(|Σ|k) pre-processing Sliding window of size k When score ≥ Pth[k], check remaining symbols with LA (up to m-k) O(n + (m -k)r) search; k is the prefix length, r is avg number of full scoring C.Pizzi, DEI – Univ. Of Padova (Italy) Lookahaed Filtration Example |Σ|k entries K=3 SCORE AAA 0.4 ... ... ATT 0.5 CAA 0.2 ... ... CGT 0.1 CTT 0.3 GAA 0.3 ... ... GTA 0.5 ... ... GTT 0.4 TAA 0.5 ... ... TTT 0.6 Pth[3]=0.3 CGTACACTCGGTA Score(CGT) = 0.1 < Pth[3] Shift and concatenate to obtain the next 3-mer C.Pizzi, DEI – Univ. Of Padova (Italy) Filtered Lookahaed Example |Σ|k entries K=3 SCORE AAA 0.4 ... ... ATT 0.5 CAA 0.2 ... ... CGT 0.1 CTT 0.3 GAA 0.3 ... ... GTA 0.5 ... ... GTT 0.4 TAA 0.5 ... ... TTT 0.6 Pth[3]=0.3 CGTACACTCGGTA Score(GTA) = 0.5 > Pth[3] Check at most m-k remaining symbols Score(GTAC) = 0.7 > Pth[4] Score(GTACA) = 1.7 > Pth[5] Score(GTACAC) = 2.1 > th Match! C.Pizzi, DEI – Univ. Of Padova (Italy) More on ACE and LF It is possible to combine both methods Automaton build on qualifying prefixes only Multi-matrix version C.Pizzi, DEI – Univ. Of Padova (Italy) 67 Super-Alphabet Code words of length k to superalphabet symbols |Σ|k symbols are needed Code the matrix M into matrix M’ (|Σ|k x m/k) Run the naive algorithm on the sequence O(nm/k) C.Pizzi, DEI – Univ. Of Padova (Italy) SuperAlphabet Example K=2 SCORE 1-2 SCORE 3-4 SCORE 5-6 AA 0.3 0.3 1.3 AC 1.1 0.3 1.4 AG 0.3 0.4 1.0 AT 0.3 0.4 1.3 |Σ|k CA 0.1 0.7 0.3 entries CC 0.9 0.7 0.4 CG 0.1 0.8 0.0 CT 0.3 0.8 0.3 GA 0.2 0.6 0.3 GC 1.0 0.6 0.4 GG 0.2 0.7 0.0 GT 0.4 0.7 0.3 TA 0.4 0.2 0.3 TC 1.2 0.2 0.4 TG 0.4 0.3 0.0 TT 0.6 0.3 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) CGTACACTCGGTA Score = 0.6 < Th SuperAlphabet Example K=2 SCORE 1-2 SCORE 3-4 SCORE 5-6 AA 0.3 0.3 1.3 AC 1.1 0.3 1.4 AG 0.3 0.4 1.0 AT 0.3 0.4 1.3 |Σ|k CA 0.1 0.7 0.3 entries CC 0.9 0.7 0.4 CG 0.1 0.8 0.0 CT 0.3 0.8 0.3 GA 0.2 0.6 0.3 GC 1.0 0.6 0.4 GG 0.2 0.7 0.0 GT 0.4 0.7 0.3 TA 0.4 0.2 0.3 TC 1.2 0.2 0.4 TG 0.4 0.3 0.0 TT 0.6 0.3 0.3 C.Pizzi, DEI – Univ. Of Padova (Italy) CGTACACTCGGTA Score = 2.1 match! Experiments Jaspar Database: 123 TFBS matrices (DNA), PRINTS database (proteins) Test sequence about 50M bases P-value defines threshold 3 GHz Intel Pentium IV processor with 2 gigabytes of main memory, running under Linux. C.Pizzi, DEI – Univ. Of Padova (Italy) DNA – avg running times per matrix C.Pizzi, DEI – Univ. Of Padova (Italy) 72 DNA- matrix length C.Pizzi, DEI – Univ. Of Padova (Italy) 73 DNA – window width C.Pizzi, DEI – Univ. Of Padova (Italy) 74 Proteins, avg time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy) 75 Proteins - matrix length C.Pizzi, DEI – Univ. Of Padova (Italy) 76 MOODS – Motif Occurrence Detection Suite C.Pizzi, DEI – Univ. Of Padova (Italy) 77 Conclusions Searching matrix is a core step for many bioinformatics applications (searching, discovery, classification…) Several approaches have been developed in recent years Online methods based on filtering are currently the most efficient C.Pizzi, DEI – Univ. Of Padova (Italy) 78 References C.Pizzi, P.Rastas, E.Ukkonen Fast Search Algorithms for Position Specific Scoring Matrices In Proc. of the 1st Conference on Bioinformatics Research and Development (BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp 239--250 C.Pizzi, E.Ukkonen Fast Profile Matching Algorithms - a survey Theoretical Computer Science, 395(2-3), 2008, pp 137--157, Special Issue SAIL: String Algorithms, Information and Learning C.Pizzi, P.Rastas, E.Ukkonen Finding significant matches of position weight matrices in linear time Accepted for publication by IEEE Transaction on Computational Biology and Bioinformatics, 2009 J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen MOODS: fast search for position weight matrix matches in DNA sequences Bioinformatics 2009 25(23):3181-3182 C.Pizzi, DEI – Univ. Of Padova (Italy) 79 Thanks C.Pizzi, DEI – Univ. Of Padova (Italy) 80 Acknowledgements Esko Ukkonen, Pasi Rastas, Janne Korhonen, P.Martinmaki Academy of Finland grant “From Data to knowledge” EU Project “Regulatory Networks” Premio di Ricerca `Avere Trent’Anni’ Univ.Padova, Parco Scientifico Galileo, Il Mattino, Giovani Confindustria, Scuola Galileiana di Studi Superiori C.Pizzi, DEI – Univ. Of Padova (Italy) Length 100 • 13 patterns obtained by concateneting Jaspar matrices •MSS: Matrix Similarity Score (% of maximal score) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) C.Pizzi, DEI – Univ. Of Padova (Italy) Multiple Matrices Search C.Pizzi, DEI – Univ. Of Padova (Italy) Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy) Length 0 to 15 (108 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm ACE = Aho-Corasick Expansion LFA = Look-ahead Filtration Algorithm (k=7) NS = Naïve Super-alphabet (k=7) C.Pizzi, DEI – Univ. Of Padova (Italy) Running Time per matrix C.Pizzi, DEI – Univ. Of Padova (Italy) Length 16 to 30 (15 matrices) NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filtration Algorithm NS = Naïve Super-alphabet C.Pizzi, DEI – Univ. Of Padova (Italy) Length 100 • 13 patterns obtained by concateneting Jaspar matrices P=10-5 P=10-4 P=10-3 P=10-2 NA 10.234 10.244 10.434 11.080 LSA 11.835 12.675 13.335 15.118 LFA 9.955 10.347 11.096 12.965 NS 3.576 3.677 4.593 9.918 NA = Naïve Algorithm LSA = Look-ahead Search Algorithm LFA = Look-ahead Filter Algorithm (k=7) NS = Naïve Superalphabet (k=7) C.Pizzi, DEI – Univ. Of Padova (Italy) Motif Representation Istances of a biological signal are different ACATAC CCGAAT ATGCAT GCCTAC TCCAAA TTCGAA ACGGAC TCCTAT GCCCAC TCGGAA TCC(G|T)AC Consensus -> pattern representation 1 2 3 4 5 6 A C G T Profile -> matrix representation C.Pizzi, DEI – Univ. Of Padova (Italy)