Algorithms to Search Position Specific
Scoring Matrices in Biosequences
Cinzia Pizzi
Dipartimento di Ingegneria dell’Informazione
Università degli Studi di Padova
Outline



Weighted patterns in Biology
The problem of profile matching
The look-ahead method





Suffix based Algorithms
Aho-Corasick Extension (ACE)
Look-ahead Filtration Algorithm (LFA)
Superalphabet (NS)
Some experimental results
C.Pizzi, DEI – Univ. Of Padova (Italy)
2
What are Motifs?



Motifs are biologically significant elements
that are responsible for common
structures or functions
Motifs are statistically significant
substrings in bio-sequences
Assumption: if two entities share same
function or structure, common overrepresented elements might be responsible
for observed similarity
C.Pizzi, DEI – Univ. Of Padova (Italy)
3
Motif Discovery




Take set of co-expressed genes
Compare their promoter regions
Common over-represented substrings
are good candidates for TFBS
Need counted/expected frequency
Promoters of
co-expressed
genes
C.Pizzi, DEI – Univ. Of Padova (Italy)
4
Motif Discovery


TFBS, DNA motifs
Motifs = binding sites = substrings
Promoters of
co-expressed
genes

Intrinsic variability of biological sequences

Mismatches, indels, wildcards, superalphabets...
C.Pizzi, DEI – Univ. Of Padova (Italy)
5
Motif Representation

Binding sites of the same factor are
not exactly the same in all sequences
ACATAC
CCGAAT
ATGCAT
GCCTAC
TCCAAA
TTCGAA
ACGGAC
TCCTAT
GCCCAC
TCGGAA
1 2 3 4 5 6
A
C
G
T
Profile
-> matrix representation
C.Pizzi, DEI – Univ. Of Padova (Italy)
Motif Representation

Protein classification: each family is
modeled by a matrix
1 2 3 4 5 6
ACDEHNPVAC
CCDEGAMMAT
ATHCATVVST
1 2 3 4 5 6
A
A
C
D
...
1 2 3 4 5 6
A
C
D
...
WVDEHNPVAC
C.Pizzi, DEI – Univ. Of Padova (Italy)
C
D
...
Profile


Weighted pattern p oflength m
defined over alphabet Σ
|Σ| x m matrix defines scores
1
A 0.3
C 0.1
G 0.2
T 0.4
2
0.0
0.8
0.0
0.2
3
0.1
0.5
0.4
0.0
4
0.2
0.2
0.3
0.3
5
1.0
0.0
0.0
0.0
C.Pizzi, DEI – Univ. Of Padova (Italy)
6
0.3
0.4
0.0
0.3
Segment Score
A
C
G
T
1
0.3
0.1
0.2
0.4
2
0.0
0.8
0.0
0.2
s1
s2
S = s1 s2 … sm
3
0.1
0.5
0.4
0.0
4
0.2
0.2
0.3
0.3
s3
s4
5
1.0
0.0
0.0
0.0
s5
m
6
0.3
0.4
0.0
0.3
s6
Score   M [ si , i ]
i 1
C.Pizzi, DEI – Univ. Of Padova (Italy)
Meaning of the score
m
f si , i

m  f
 f si , i 

P( S | M )
si , i
i 1




Score   M [ si , i ]   ln
 ln 
 ln m





P( S | B)
i 1
i 1
i 1  psi 
 p si 
p
 s
m
m
i 1
C.Pizzi, DEI – Univ. Of Padova (Italy)
i
10
Segment Score Example
A
C
G
T
1
0.3
0.1
0.2
0.4
2
0.0
0.8
0.0
0.2
3
0.1
0.5
0.4
0.0
4
0.2
0.2
0.3
0.3
5
1.0
0.0
0.0
0.0
6
0.3
0.4
0.0
0.3
G
T
A
C
A
C
Score = 2.1
C.Pizzi, DEI – Univ. Of Padova (Italy)
Profile Matching Problem





Text T of length n defined over Σ
Profile p (|Σ| x m)
Score threshold th
Score Si of the segment of length m
starting at position i
Find all positions i in T where Si ≥ th
C.Pizzi, DEI – Univ. Of Padova (Italy)
Example: th = 2
CGTACACTCGGTA
Score = 0.6
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 2.1
Match at pos 2!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 1.4
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 1.8
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 0.9
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 1.3
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 1.4
Not a match!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Example: th = 2
CGTACACTCGGTA
Score = 2.2
Match at pos 8!
1
2
3
4
6
A
0.3 0.0
C
0.1 0.8 0.5 0.2 0.0 0.4
G
0.2 0.0 0.4 0.3 0.0 0.0
T
0.4 0.2 0.0 0.3 0.0 0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
0.1 0.2
5
1.0 0.3
Scenarios of applications

Online Algorithms (no indexing)



Database of profile matrices (e.g.
TRANSFAC, JASPAR for TFBS)
Input sequence to be searched
Offline algorithms (indexing)


Sequence or set of sequences
Input matrix to search for matches
C.Pizzi, DEI – Univ. Of Padova (Italy)
Summary of current methods


Look-ahead method LA (Wu et al,00)
Offline methods based on LA:




Suffix-tree (Dorohonceanu et al, 00)
Suffix-array (Beckstette et al, 04,06)
Truncated Suffix Tree (Pizzi and
Favaretto, 10)
Online methods based on LA:

Aho-Corasick,Filtering(Pizzi et al. 07,09)
C.Pizzi, DEI – Univ. Of Padova (Italy)
Summary of current methods

Pattern Matching





Shift-Add (Salmela e Tarhio, 08)
KMP (Liefoghee et al, 09)
Matrix partitioning (Liefhooghe et al.,06,
Pizzi et al., 07, 09)
FFT based (Rajasekaran et al., 02)
Compression based(Freschi et al., 05)
C.Pizzi, DEI – Univ. Of Padova (Italy)
The look-ahead approach
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
3.0
2.2
1.7
1.4
0.4
-1.0 -0.2
0.3
0.6
1.6
2.0
max
Pth
C.Pizzi, DEI – Univ. Of Padova (Italy)
m
max[ i ]   M [ si , i ]
k i
Pth [i ]  th  max[ i  1]
The look-ahead approach
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
3.0
2.2
1.7
1.4
0.4
-1.0 -0.2
0.3
0.6
1.6
2.0
T
A
C
A
max
Pth
C
0.1
G
C.Pizzi, DEI – Univ. Of Padova (Italy)
The look-ahead approach
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
3.0
2.2
1.7
1.4
0.4
-1.0 -0.2
0.3
0.6
1.6
2.0
T
A
C
A
max
Pth
C
0.1
G
0.1
C.Pizzi, DEI – Univ. Of Padova (Italy)
The look-ahead approach
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
3.0
2.2
1.7
1.4
0.4
-1.0 -0.2
0.3
0.6
1.6
2.0
T
0.1
A
C
A
max
Pth
C
0.1
G
0.1
Don’t need to compare these ones!
C.Pizzi, DEI – Univ. Of Padova (Italy)
The suffix tree of T



data structure suffix tree, Tree(T),
is compacted trie that represents all
the suffixes of string T
linear size: |Tree(T)| = O(|T|)
can be constructed in linear time
O(|T|)
C.Pizzi, DEI – Univ. Of Padova (Italy)
Suffix trie and suffix tree
abaab
baab
aab
ab
b
Trie(abaab)
a
b
a
a
b
Tree(abaab)
a
b
a
a
b
baab
a
b
C.Pizzi, DEI – Univ. Of Padova (Italy)
ab
baab
Tree(T) is of linear size





only the internal branching nodes and
the leaves represented explicitly
edges labeled by substrings of T
v = node(α) if the path from root to v
spells α
one-to-one correspondence of leaves
and suffixes
|T| leaves, hence < |T| internal nodes
C.Pizzi, DEI – Univ. Of Padova (Italy)
30
Tree(hattivatti)
vatti
hattivatti
t i
attivatti
ttivatti
tivatti
hattivatti
ti
atti
vatti
i
i
vatti
ivatti
vatti
atti
tti
ti
vatti
vatti
tti
atti
hattivatti
attivatti
ttivatti
i
C.Pizzi, DEI – Univ. Of Padova (Italy)
vatti
ti
tivatti
ivatti
Tree(hattivatti)
vatti
hattivatti
3,3
attivatti
ttivatti
tivatti
hattivatti
2,5 4,5
ivatti
vatti
tti
ti
i
vatti
6,10
6,10
atti
1
7
2
6
10
9
8
3
hattivatti
C.Pizzi, DEI – Univ. Of Padova (Italy)
4
5
Tree(T) is full text index
Tree(T)
P occurs in T at
locations 8, 31, …
P
31
8
P occurs in T  P is a prefix of some suffix of
T  Path for P exists in Tree(T)
All occurrences of P in time O(|P| + #occ)
C.Pizzi, DEI – Univ. Of Padova (Italy)
LA over a Suffix Tree
CG
T
TCC
G
Score(CG)=0.2 > -0.2 = Th(2)
Score(CGT)=0.2 < 0.3 = Th(3) : Skip the subtree
C.Pizzi, DEI – Univ. Of Padova (Italy)
LA over a Suffix Tree
CG
T
TCC
G
Score(TCC)=1.9 > 0.3 = Th(3)
Score(TCCG)=2.2 > 2 = Th(6) : Match, all the subtree
C.Pizzi, DEI – Univ. Of Padova (Italy)
Suffix array: example
hattivatti
ε
11
attivatti
atti
7
ttivatti
attivatti
2
tivatti
hattivatti
1
ivatti
i
10
vatti
ivatti
5
atti
ti
9
tti
tivatti
4
ti
tti
8
i
ttivatti
3
ε
vatti
6
C.Pizzi, DEI – Univ. Of Padova (Italy)

suffix array =
lexicographic
order of the
suffixes
Suffix array



suffix array SA(T) = an array giving
the lexicographic order of the
suffixes of T
practitioners like suffix arrays
(simplicity, space efficiency)
theoreticians like suffix trees
(explicit structure)
C.Pizzi, DEI – Univ. Of Padova (Italy)
37
LA over a Suffix Array
In terms of suffix trees, skp[i] is the lexicographically next leaf that does not
occur in the subtree below the branching node corresponding to the longest
common prefix of Ssuf[i-1] and Ssuf[i].
skp[i] = min({n + 1} U [ j in [i + 1; n] | lcp[i] > lcp[j])
C.Pizzi, DEI – Univ. Of Padova (Italy)
LA over Truncated ST




Build TST with truncation factor h
L = max length of a matrix in the DB
if h=L, simply work as ST
if h<L, filtering


if a leaf is reached take corresponding
positions (p1, p2, …, pt)
For each pi check positions pi+j, h<j<=m
with lookahead
C.Pizzi, DEI – Univ. Of Padova (Italy)
39
LA over Truncated ST
h
L
p1 p2 p3
p1
p1 + h
L-h
p2
p2 +h
L-h
C.Pizzi, DEI – Univ. Of Padova (Italy)
p3
p3 +h
L-h
40
Space OccupationTruST
C.Pizzi, DEI – Univ. Of Padova (Italy)
41
Running Time TruST
C.Pizzi, DEI – Univ. Of Padova (Italy)
42
Online Profile Matching

Aho-Corasick Expansion (ACE)


Lookahead Filtration Algorithm(LFA)


Pattern matching + LA
Score for fixed length prefix as a filter
+ LA
Naive Superalphabet (NS)

Encode k-mers in superalphabet symbol
C.Pizzi, DEI – Univ. Of Padova (Italy)
The Aho-Corasick Algorithm

A trie for D = {he, she, his, hers}
C.Pizzi, DEI – Univ. Of Padova (Italy)
The Aho-Corasick algorithm

Add failure links

his -- she
Time O(n+m)
Space depends on D
m = sum of word lengths
C.Pizzi, DEI – Univ. Of Padova (Italy)
The Fast Aho-Corasick
h
h,s
h
0
r
h e,i
1 e 2
s
s i
s
Time O(n)
Space depends on D and Σ
e,i,r
6
3
s
C.Pizzi, DEI – Univ. Of Padova (Italy)
h
4
r
s
e
8
7
5
s
9
AC and profile matching

Build AC automaton for all the words that
are a match for the matrix




LA partial threshold limits the number of words
to those that actually match
O(|D||Σ|m + m|Σ|) pre-processing
|D|≤|Σ|m depends on matrix and threshold
Search the text with AC automaton

O(n) search
C.Pizzi, DEI – Univ. Of Padova (Italy)
AC-Extension by LA

First position
[C,0.1]
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
-1.0 -0.2
0.3
0.6
1.6
2.0
Pth
C.Pizzi, DEI – Univ. Of Padova (Italy)
[G,0.2]
[T,0.4]
[A,0.3]
AC-Extension by LA

Second position
[C,0.1]
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
-1.0 -0.2
0.3
0.6
1.6
2.0
Pth
[G,0.2]
[T,0.4]
[A,0.3]
[A,0.1]
[G,0.1]
C.Pizzi, DEI – Univ. Of Padova (Italy)
[T,0.3]
[C,0.9]
AC-Extension by LA

Third position
[C,0.1]
1
2
3
4
5
6
A
0.3
0.0
0.1
0.2
1.0
0.3
C
0.1
0.8
0.5
0.2
0.0
0.4
G
0.2
0.0
0.4
0.3
0.0
0.0
T
0.4
0.2
0.0
0.3
0.0
0.3
-1.0 -0.2
0.3
0.6
1.6
2.0
Pth
[G,0.2]
[T,0.4]
[A,0.3]
[A,0.1]
[G,0.1]
[G,0.5] [C,0.6]
C.Pizzi, DEI – Univ. Of Padova (Italy)
[T,0.3]
[C,0.9]
ACE Example
1
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
2
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
3
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
4
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
5
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
6
CGTACACTCGGTA
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
ACE Example
7
CGTACACTCGGTA
Match at p-m+1 = 7-6+1=2
t
g
c
a
t
g
a
t
c
a
c
g
c
C.Pizzi, DEI – Univ. Of Padova (Italy)
Minimum Gain for ACE



Dual Concept of look-ahead
Compute for every prefix the
minimum contribution of the
remaining positions in the pattern
If current_score(i) + min_gain(i) > Th


Report a match
Adv: in the automaton save a full
subtree of height m-i
C.Pizzi, DEI – Univ. Of Padova (Italy)
Example: M0003, MSS=0.85
[G,18500]
C.Pizzi, DEI – Univ. Of Padova (Italy)
Example: M0003, MSS=0.85
[G,18500]
[C,37000]
C.Pizzi, DEI – Univ. Of Padova (Italy)
Example: M0003, MSS=0.85
[G,18500]
[C,37000]
[C,55500]
• GCC is sufficient to detect a match
C.Pizzi, DEI – Univ. Of Padova (Italy)
h=3
Example: M0003, MSS=0.85
[G,18500]
[C,37000]
h=3
[C,55500]
• Save 5464 nodes out of 5468
C.Pizzi, DEI – Univ. Of Padova (Italy)
mh
i
|
A
|

i 1
Minimum Gain ACE
C.Pizzi, DEI – Univ. Of Padova (Italy)
Look-ahead Filtration

Compute the scores for all words of fixed length k
and store them




O(|Σ|k) pre-processing
Sliding window of size k
When score ≥ Pth[k], check remaining symbols with
LA (up to m-k)
O(n + (m -k)r) search;

k is the prefix length, r is avg number of full scoring
C.Pizzi, DEI – Univ. Of Padova (Italy)
Lookahaed Filtration Example
|Σ|k
entries
K=3
SCORE
AAA
0.4
...
...
ATT
0.5
CAA
0.2
...
...
CGT
0.1
CTT
0.3
GAA
0.3
...
...
GTA
0.5
...
...
GTT
0.4
TAA
0.5
...
...
TTT
0.6
Pth[3]=0.3
CGTACACTCGGTA
Score(CGT) = 0.1 < Pth[3]
Shift and concatenate to obtain the
next 3-mer
C.Pizzi, DEI – Univ. Of Padova (Italy)
Filtered Lookahaed Example
|Σ|k
entries
K=3
SCORE
AAA
0.4
...
...
ATT
0.5
CAA
0.2
...
...
CGT
0.1
CTT
0.3
GAA
0.3
...
...
GTA
0.5
...
...
GTT
0.4
TAA
0.5
...
...
TTT
0.6
Pth[3]=0.3
CGTACACTCGGTA
Score(GTA) = 0.5 > Pth[3]
Check at most m-k remaining symbols
Score(GTAC) = 0.7 > Pth[4]
Score(GTACA) = 1.7 > Pth[5]
Score(GTACAC) = 2.1 > th
Match!
C.Pizzi, DEI – Univ. Of Padova (Italy)
More on ACE and LF

It is possible to combine both
methods


Automaton build on qualifying prefixes
only
Multi-matrix version
C.Pizzi, DEI – Univ. Of Padova (Italy)
67
Super-Alphabet

Code words of length k to superalphabet symbols



|Σ|k symbols are needed
Code the matrix M into matrix M’
(|Σ|k x m/k)
Run the naive algorithm on the
sequence O(nm/k)
C.Pizzi, DEI – Univ. Of Padova (Italy)
SuperAlphabet Example
K=2
SCORE 1-2
SCORE 3-4
SCORE 5-6
AA
0.3
0.3
1.3
AC
1.1
0.3
1.4
AG
0.3
0.4
1.0
AT
0.3
0.4
1.3
|Σ|k
CA
0.1
0.7
0.3
entries
CC
0.9
0.7
0.4
CG
0.1
0.8
0.0
CT
0.3
0.8
0.3
GA
0.2
0.6
0.3
GC
1.0
0.6
0.4
GG
0.2
0.7
0.0
GT
0.4
0.7
0.3
TA
0.4
0.2
0.3
TC
1.2
0.2
0.4
TG
0.4
0.3
0.0
TT
0.6
0.3
0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
CGTACACTCGGTA
Score = 0.6 < Th
SuperAlphabet Example
K=2
SCORE 1-2
SCORE 3-4
SCORE 5-6
AA
0.3
0.3
1.3
AC
1.1
0.3
1.4
AG
0.3
0.4
1.0
AT
0.3
0.4
1.3
|Σ|k
CA
0.1
0.7
0.3
entries
CC
0.9
0.7
0.4
CG
0.1
0.8
0.0
CT
0.3
0.8
0.3
GA
0.2
0.6
0.3
GC
1.0
0.6
0.4
GG
0.2
0.7
0.0
GT
0.4
0.7
0.3
TA
0.4
0.2
0.3
TC
1.2
0.2
0.4
TG
0.4
0.3
0.0
TT
0.6
0.3
0.3
C.Pizzi, DEI – Univ. Of Padova (Italy)
CGTACACTCGGTA
Score = 2.1 match!
Experiments




Jaspar Database: 123 TFBS matrices
(DNA), PRINTS database (proteins)
Test sequence about 50M bases
P-value defines threshold
3 GHz Intel Pentium IV processor
with 2 gigabytes of main memory,
running under Linux.
C.Pizzi, DEI – Univ. Of Padova (Italy)
DNA – avg running times per matrix
C.Pizzi, DEI – Univ. Of Padova (Italy)
72
DNA- matrix length
C.Pizzi, DEI – Univ. Of Padova (Italy)
73
DNA – window width
C.Pizzi, DEI – Univ. Of Padova (Italy)
74
Proteins, avg time per matrix
C.Pizzi, DEI – Univ. Of Padova (Italy)
75
Proteins - matrix length
C.Pizzi, DEI – Univ. Of Padova (Italy)
76
MOODS – Motif Occurrence Detection Suite
C.Pizzi, DEI – Univ. Of Padova (Italy)
77
Conclusions



Searching matrix is a core step for
many bioinformatics applications
(searching, discovery, classification…)
Several approaches have been
developed in recent years
Online methods based on filtering are
currently the most efficient
C.Pizzi, DEI – Univ. Of Padova (Italy)
78
References




C.Pizzi, P.Rastas, E.Ukkonen
Fast Search Algorithms for Position Specific Scoring Matrices
In Proc. of the 1st Conference on Bioinformatics Research and Development
(BIRD 07), Berlin, Germany, March 2007, LNCS/LCBI 4414 pp 239--250
C.Pizzi, E.Ukkonen
Fast Profile Matching Algorithms - a survey
Theoretical Computer Science, 395(2-3), 2008, pp 137--157, Special Issue
SAIL: String Algorithms, Information and Learning
C.Pizzi, P.Rastas, E.Ukkonen
Finding significant matches of position weight matrices in linear time
Accepted for publication by IEEE Transaction on Computational Biology and
Bioinformatics, 2009
J.Korhonen, P.Martinmaki, C.Pizzi, P.Rastas, E.Ukkonen
MOODS: fast search for position weight matrix matches in DNA
sequences
Bioinformatics 2009 25(23):3181-3182
C.Pizzi, DEI – Univ. Of Padova (Italy)
79
Thanks
C.Pizzi, DEI – Univ. Of Padova (Italy)
80
Acknowledgements

Esko Ukkonen, Pasi Rastas, Janne
Korhonen, P.Martinmaki



Academy of Finland grant “From Data to
knowledge”
EU Project “Regulatory Networks”
Premio di Ricerca `Avere Trent’Anni’

Univ.Padova, Parco Scientifico Galileo, Il
Mattino, Giovani Confindustria, Scuola
Galileiana di Studi Superiori
C.Pizzi, DEI – Univ. Of Padova (Italy)
Length 100
• 13 patterns obtained by concateneting Jaspar matrices
•MSS: Matrix Similarity Score (% of maximal score)
NA = Naïve Algorithm
LSA = Look-ahead Search Algorithm
LFA = Look-ahead Filter Algorithm (k=7)
NS = Naïve Superalphabet (k=7)
C.Pizzi, DEI – Univ. Of Padova (Italy)
Multiple Matrices Search
C.Pizzi, DEI – Univ. Of Padova (Italy)
Running Time per matrix
C.Pizzi, DEI – Univ. Of Padova (Italy)
Length 0 to 15 (108 matrices)
NA = Naïve Algorithm
LSA = Look-ahead Search Algorithm
ACE = Aho-Corasick Expansion
LFA = Look-ahead Filtration Algorithm (k=7)
NS = Naïve Super-alphabet (k=7)
C.Pizzi, DEI – Univ. Of Padova (Italy)
Running Time per matrix
C.Pizzi, DEI – Univ. Of Padova (Italy)
Length 16 to 30 (15 matrices)
NA = Naïve Algorithm
LSA = Look-ahead Search Algorithm
LFA = Look-ahead Filtration Algorithm
NS = Naïve Super-alphabet
C.Pizzi, DEI – Univ. Of Padova (Italy)
Length 100
• 13 patterns obtained by concateneting Jaspar matrices
P=10-5
P=10-4
P=10-3
P=10-2
NA
10.234
10.244
10.434
11.080
LSA
11.835
12.675
13.335
15.118
LFA
9.955
10.347
11.096
12.965
NS
3.576
3.677
4.593
9.918
NA = Naïve Algorithm
LSA = Look-ahead Search Algorithm
LFA = Look-ahead Filter Algorithm (k=7)
NS = Naïve Superalphabet (k=7)
C.Pizzi, DEI – Univ. Of Padova (Italy)
Motif Representation

Istances of a biological signal are
different
ACATAC
CCGAAT
ATGCAT
GCCTAC
TCCAAA
TTCGAA
ACGGAC
TCCTAT
GCCCAC
TCGGAA
TCC(G|T)AC
Consensus
-> pattern representation
1 2 3 4 5 6
A
C
G
T
Profile
-> matrix representation
C.Pizzi, DEI – Univ. Of Padova (Italy)
Scarica

cgtacactcggta