Le reti neurali e la
predizione della struttura
proteica
Rita Casadio
Interdepartmental Centre for Biotechnological Research
University of Bologna, Italy
L’era “omica”: genomi completi
•Archea:
16 speci/33 in progress
•Batteri:
83 speci
•Eukarioti: 17 speci (242 chromosomi)
www.ncbi.nlm.nih.gov
Draft del genoma umano
•Nature (2/15/01) Human Genome Issue
http://www.ncbi.nlm.nih.gov/genome/guide/human
http://www.ensembl.org/
•Science (2/16/01) Human Genome Issue
http://public.celera.com/index.cfm
Dalla Sequenza alla Funzione
Genomica funzionale, Proteomica ed
Interattomica
> RICIN GLYCOSIDASE
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
Geni
Sequenze proteiche
Strutture proteiche
Funzione
PRINCIPI DI BASE DELLA STRUTTURA
DELLE PROTEINE
Livelli di organizzazione strutturale
Primaria
Secondaria
Terziaria
Quaternaria
PRINCIPI DI BASE DELLA STRUTTURA
DELLE PROTEINE
Gli elementi di struttura secondaria
Foglietto b
a -elica
C
La predizione del Protein Folding
Il processo di folding
La cinetica del Folding:
La proteina
nativa
La catena
I siti di iniziazione
Le Banche Dati di Sequenze Biologiche
e Strutture
NCBI:
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
18,197,119 sequenze
22,616,937,182 nucleotidi
Swiss-Prot:
PDB:
113,470 sequenze
41,413,223 residui
17,510 strutture
August/2002
Possiamo estrarre dal PDB circa 1500
esempi di catene di cui e’ nota la struttura
terziaria al fine di ricavare informazioni
non ridondanti per la relazione tra
sequenza e:
Struttura secondaria
Motivi strutturali e funzionali
Struttura terziaria (3D)
Il Protein Folding
T T C C P S I V A R S N F N V C R L P G T P E A L C A T
Y T G C I I I P G A T C P G D Y A N
Caratteristiche della Predizione
Strutturale di Sequenze Proteiche
Ampio insieme di dati per cui la soluzione del problema
è nota
 E’ difficile (impossibile) formulare una soluzione
analitica del problema
Le banche dati vengono aggiornate in modo continuo
(grande volume di dati, necessità di operare in tempo
reale)
Mapping generale non-lineare funzionale
X
x1 x2 ………xn
X space
Y
y1 y2 ………yn
Tools derivati dall’apprendimento automatico:
Reti Neurali
Training
Predizione
Set dalla banca dati
Nuova sequenza
Regole
Generali
Mapping noto
Predizione
La finestra di input
Le proprieta’ del residuo R dipendono sia dalle
interazioni locali (finestra W) che da quelle non locali
(contesto C)
Contesto C
Finestra W
Residuo R
Rete Neurale
Oa Onon a
Input basato sulla Informazione Evolutiva
Multiple Sequence Alignment (MSA)
Posizione lungo la sequenza
1
2
3
4
5
6
7
8
9
10
11
12
13
MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSPTGVAITSSGTKKGEL--FLGDV
MAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSPTGVAITSSGTKKGDL--FLADV
MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTANGVAITSTGTKKGEL--FLADV
MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSTTGVAITASGLKKGEL--FLADV
MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSSTGVAITTTGTNKGSL--FLGDV
MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSPNGVAITAAGTRKNES--IFGEL
-MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTANGPAITATSTKKADL--TVGEI
AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI
--AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSL
-MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVNGSL
--AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSATGVEFTTSGTSNTDSGKVNGSL
--MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTSSGIEFNTAGHSNQESGKVFGSL
--MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTPNNVAFKVTGKSTHDK-VTSGAL
Sequenze allineate
Finestra di Input
Artificial Neural Networks
Percettrone a singolo strato
z1
Outputs
zm
d
a=
w
S
i=0
i
xi
z = g (a)
Bias
x0
x1
Inputs
xd
La Funzione di Errore
Y i (X q) = Output of the network
D iq = Expected Value
L’ Algoritmo di Training: il Back Propagation
(gradient descendent: Rumelhart et al. 1986)
Correction to the weights
m = learning rate
h = momentum term
Parametri variabili delle Reti Neurali
• Il codice di input
• L’ampiezza della finestra mobile
• L’architettura: il numero di nodi (neuroni) e gli strati
di neuroni
• La velocità di apprendimento
Le Reti Neurali a Bologna predicono:
•La struttura secondaria delle proteine
•I siti di iniziazione del protein folding
•La topologia delle proteine di membrana all alpha and all
beta (ISMB BEST PAPER AWARD 2002)
•La presenza dei peptidi segnale
•Lo stato di legame delle cisteine e la topologia dei
ponti a zolfo
•Le mappe di contatto delle proteine (BEST PREDICTOR
of the CATEGORY at CASP4)
•Le superfici di interazione tra proteine
www.biocomp.unibo.it
Schema generale dei predittori
disponibili al nostro sito web
Predittori basati su Reti Neurali
Verso la predizione della struttura 3D:
La predizione delle mappe dei contatti
Predizione dei contatti tra residui
Contatti nelle Proteine
F 156
F 297
V 299
I 269
V 238
V 271
I 240
Computation of Contact Maps
From 3D Structure
F 156
F 297
I 269
V 238
V 299
V 271
I 240
To Contact Map
T
T
C
C
P
S
I
V
A
R
S
N
F
N
V
C
R
L
P
G
T
P
E
A
I
C
A
T
Y
T
G
C
I
I
I
P
G
A
T
C
P
G
D
Y
A
N
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
3-D Modelling through Contact Maps
Bacteriorhodopsin
Model
1QHJ (1.9 Å)
N
Contact map
MARC
C
RMSD = 2.5 Å
Tools di Apprendimento Automatico
Le Reti Neurali imparano il mapping dalla sequenza
alla mappa dei contatti
Training
Predizione
Set Banca Dati
Sequenza
TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN
Regole
generali
Mapping noto
Predizione della mappa
dei contatti
T0087: 310 residues
A=20 % (FR/NF)
C
N
T0110: 128 residues
A=30% (NF)
N
C
Predittori basati su Reti Neurali
Verso la predizione della struttura 3D:
La predizione dei ponti a zolfo
Il Protein Folding
RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTF
VYGGCRAKRNNFKSAEDCMRTCGGA
I legami a zolfo tra cisteine nelle proteine
S
Ca
C
Ca
S
C
2-SH -> -SS- + 2H+ + 2eS-S distance  2.2 Å
Torsion angle C-S-S-C  90°
Bond Energy  3 Kcal/mol
Neural Networks for the Prediction of the
disulfide-bonding state of cysteines in proteins
Bonding
1
2
3
4
5
6
7
8
9
10
11
12
13
Non bonding
MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSCTGVAITSSGTKKGEL--FLGDV
SAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSCTGVAITSSGTKKGDL--FLADV
MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTCNGVAITSTGTKKGEL--FLADV
MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSCTGVAITASGLKKGEL--FLADV
MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSCTGVAITTTGTNKGSL--FLGDV
MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSCNGVAITAAGTRKNES--IFGEL
-MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTCNGPAITATSTKKADL--TVGEI
AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI
--AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVTGSL
-MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVNGSL
--AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSCTGVEFTTSGTSNTDSGKVNGSL
--MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTCSGIEFNTAGHSNQESGKVFGSL
--MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTCNNVAFKVTGKSTHDK-VTSGAL
W1
W2
W3
MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP...
P(B|W1), P(F|W1)
P(B|W2), P(F|W2)
P(B|W3), P(F|W3)
Begi
n
Cysteine
free
states
Cysteine
bonding
states
End
Most probable path through the states
Prediction of the bonding and non-bonding states of all the cysteines of the sequence
Il sistema ibrido
Accuratezza per cisteina: 88%; per proteina: 84%
NN-based predictor
HNN-based predictor
100
90
Correctly predicted proteins (%)
80
70
60
50
40
30
20
10
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
187 207 106 144 71 80 35 55 18 16 4 16 0 7 1 8 2 4 0 0 0
1 1 3 0 0 0
0 0 1 0 0 0
1 1
No of cysteines per protein
No of proteins
Protein Science, in press
Output
Input
www.prion.biocomp.unibo.it/cyspred.html
VGDKLIPLKITYDYYVCNNH
MDTDTSYERWPALGTYRPLN
GRDCVMNNHKLAASDRWECD
VGDKLIPLKITYDYYVCNNHMDTDTSYERWPA
QREPLYTCMCNKDLPTKAAG
LGTYRPLNGRDCVMNNHKLAASDRWECDQREP
LYTCMCNKDLPTKAAGPLMNTRPILNLSREEW
PLMNTRPILNLSREEWLLPL
LLPLLTHMNVVAGLCKLP
LTHMNVVAGLCKLP
VGDKLIPLKITYDYYVCNNHMDTDTSYERWPALG
TYRPLNGRDCVMNNHKLAASDRWECDQREPLYTC
MCNKDLPTKAAGPLMNTRPILNLSREEWLLPLLT
HMNVVAGLCKLP
Disulfide bonding cysteine
Free cysteine
I PREDITTORI POSSONO ESSERE
USATI PER SCOPRIRE NUOVE
PROTEINE?
Escherichia coli K12, genoma completo
Completed: Oct 13, 1998.
Total Bases: 4,639,221 bp
NCBI
(www.ncbi.nlm.nih.gov)
Protein coding genes: 4,289
Structural RNAs: 115
EcoGene/EcoProt
(bmb.med.miami.edu/EcoGene)
Protein coding genes: 4,173
Structural RNAs : 120
EcoGene/SwissProt functional annotation
Keywords of SwissProt entries (if exist) are extracted :
2160
421
35
1704
ANNOTATED PROTEINS (52 %)
Inner membrane proteins
Outer membrane proteins
Globular proteins
760 PARTIALLY ANNOTATED PROTEINS (18 %)
proteins annotated as “Hypothetical proteins” and with
other functional annotations
352 Inner membrane proteins
18 Outer membrane proteins
390 Globular proteins
1253 NON ANNOTATED PROTEINS (30 %)
137 proteins don’t have SwissProt entry
1116 proteins don’t contain functional annotation in SwissProt
Outer Membrane proteins
(all b-Transmembrane proteins)
Inner Membrane proteins
(all a-Transmembrane proteins)
PROTEOME
HUNTER
Signal peptide
All-a TM
All-a TM
All-b TM
Globular
all a-TM
Globular
all b-TM
all a-TM
Predicting globular, inner and outer membrane proteins in
genomes of Gram-negative bacteria with Hunter
Organism
Escherichia coli K12
New*
Escherichia coli O157:H7
New
Chlamidia pneumoniae CWL029
New
Salmonella typhimurium LT2
New
Neisseria meningitidis MC58
New
Helicobacter pylori 26695
New
Haemophylus influentiae Rd
New
Thermotoga maritima
New
Pseudomonas aeruginosa
New
Outer
membrane
65 (1.6%)
18
78 (1.5%)
10
12 (1.1%)
2
70 (1.6%)
0
34 (1.7%)
6
36 (2.3%)
10
23 (1.3%)
5
18 (1.0%)
11
131 (2.4%)
62
Inner
membrane
907 (21.7%)
136
1034 (19.3%)
327
290 (27.6%)
181
1002 (22.5%)
2
372 (18.4%)
176
352 (22.5%)
141
348 (20.4%)
121
370 (20.0%)
203
1292 (23.2%)
616
Globular
3201 (76.7%)
1099
4249 (79.2%)
1564
750 (71.3%)
236
3379 (75.9%)
21
1619 (80.0%)
662
1178 (75.2%)
445
1338 (78.3%)
430
1458 (79.0%)
559
4142 (74.4%)
1867
Total
4173
1253
5361
1901
1052
419
4451
23
2025
844
1566
596
1709
556
1846
773
5565
2545
* the number of new proteins predicted in the class with Hunter, out of the
non-annotated region
http://www.biocomp.unibo.it
www.biocomp.unibo.it
Welcome to the CIRB Biocomputing Group home
page
This is the Biocomputing unit of the CIRB Centro Interdipartimentale di Ricerche Biotecnologiche
Group Main Research Fields.
Group Publications
Technology provider for the DRUG consortium of the
NOTSOMAD TTN initiative.
BIOCOMPUTING GROUP
Group leader : Rita Casadio
Group members:
 Piero Fariselli
 Mario Compiani
Pier Luigi Martelli Emidio Capriotti
Ivan Rossi
Gianluca Tasco
Collaborazioni
Italia
L.Masotti, Biochemistry, Bologna
M.Rossi, IBPE/CNR, Napoli
G.Mita, IIGB/CNR, Napoli
G.Irace, Biochemistry, Napoli
D.Boraschi, CNR, Pisa
P.Arrigo, ICE/CNR, Genova
P.Mariani, Physics, Ancona
G.Campadelli-Fiume, Pathology, Bologna
S.Prosperi, Veterinary, Bologna
F.Bernardi, Chemistry, Bologna
S.Ciurli, Agricultural Chemistry, Bologna
C.Bergamini, Biochemistry, Ferrara
All’estero
B.Rost, Columbia University, New York
A.Valencia, Protein Design Group, Cantoblanco, Madrid
P.Baldi, Genomics and Bioinformatics, Irvine, California
A.Krogh, University of Copenhagen, Copenhagen
N.Ben Tal, Israel Insitute of Technology, Tel Aviv
The cross validation procedure
Protein set
Training set
Testing set
Evaluation of the performance
Q2
=
Q(x) =
p+n
correct
predictions
———————— = ——
N
total predictions
correct predictions in class x
p
———————————— = ——
total observations in class x
p+u
P(x) =
correct predictions in class x
p
———————————— = ——
total predictions in class x
p+o
C
p·n - o ·u
Correlation index = —————————————
[(p+o) ·(p+u) ·(n+o) ·(n+u)]1/2
=
Legend:
Observed
Predicted
x
Non-x
x
p
o
Non-x
u
n
Evaluation of the efficiency of contact map predictions
1) Accuracy:
A = Ncp* / Ncp
where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted
contacts, respectively.
2) Improvement over a random predictor :
R = A / (Nc/Np)
where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the
protein of length Lp, and Np are all the possible contacts
3) Difference in the distribution of the inter-residue distances in the 3D structure for
predicted pairs compared with all pair distances in the structure (Pazos et al., 1997):
Xd=S i=1,n (Pic - Pia ) / n di
where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to
60Å cluster all the possible distances of residue pairs observed in the protein structure); di is the
upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; Pic and Pia are the
percentage of predicted contact pairs (with distance between di and di-1 ) and that of all possible
pairs respectively
The cross validation procedure
Protein set
Testing set 1
Training set 1
PRINCIPI DI BASE DELLA STRUTTURA
DELLE PROTEINE
Gli elementi della costruzione della struttura primaria
Amminoacidi
Backbone della proteina
Scarica

Nessun titolo diapositiva - Bioinformatics and Genomics Unit