Le reti neurali e la predizione della struttura proteica Rita Casadio Interdepartmental Centre for Biotechnological Research University of Bologna, Italy L’era “omica”: genomi completi •Archea: 16 speci/33 in progress •Batteri: 83 speci •Eukarioti: 17 speci (242 chromosomi) www.ncbi.nlm.nih.gov Draft del genoma umano •Nature (2/15/01) Human Genome Issue http://www.ncbi.nlm.nih.gov/genome/guide/human http://www.ensembl.org/ •Science (2/16/01) Human Genome Issue http://public.celera.com/index.cfm Dalla Sequenza alla Funzione Genomica funzionale, Proteomica ed Interattomica > RICIN GLYCOSIDASE MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH Geni Sequenze proteiche Strutture proteiche Funzione PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Livelli di organizzazione strutturale Primaria Secondaria Terziaria Quaternaria PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Gli elementi di struttura secondaria Foglietto b a -elica C La predizione del Protein Folding Il processo di folding La cinetica del Folding: La proteina nativa La catena I siti di iniziazione Le Banche Dati di Sequenze Biologiche e Strutture NCBI: >BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus. MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH 18,197,119 sequenze 22,616,937,182 nucleotidi Swiss-Prot: PDB: 113,470 sequenze 41,413,223 residui 17,510 strutture August/2002 Possiamo estrarre dal PDB circa 1500 esempi di catene di cui e’ nota la struttura terziaria al fine di ricavare informazioni non ridondanti per la relazione tra sequenza e: Struttura secondaria Motivi strutturali e funzionali Struttura terziaria (3D) Il Protein Folding T T C C P S I V A R S N F N V C R L P G T P E A L C A T Y T G C I I I P G A T C P G D Y A N Caratteristiche della Predizione Strutturale di Sequenze Proteiche Ampio insieme di dati per cui la soluzione del problema è nota E’ difficile (impossibile) formulare una soluzione analitica del problema Le banche dati vengono aggiornate in modo continuo (grande volume di dati, necessità di operare in tempo reale) Mapping generale non-lineare funzionale X x1 x2 ………xn X space Y y1 y2 ………yn Tools derivati dall’apprendimento automatico: Reti Neurali Training Predizione Set dalla banca dati Nuova sequenza Regole Generali Mapping noto Predizione La finestra di input Le proprieta’ del residuo R dipendono sia dalle interazioni locali (finestra W) che da quelle non locali (contesto C) Contesto C Finestra W Residuo R Rete Neurale Oa Onon a Input basato sulla Informazione Evolutiva Multiple Sequence Alignment (MSA) Posizione lungo la sequenza 1 2 3 4 5 6 7 8 9 10 11 12 13 MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSPTGVAITSSGTKKGEL--FLGDV MAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSPTGVAITSSGTKKGDL--FLADV MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTANGVAITSTGTKKGEL--FLADV MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSTTGVAITASGLKKGEL--FLADV MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSSTGVAITTTGTNKGSL--FLGDV MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSPNGVAITAAGTRKNES--IFGEL -MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTANGPAITATSTKKADL--TVGEI AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI --AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSL -MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVNGSL --AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSATGVEFTTSGTSNTDSGKVNGSL --MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTSSGIEFNTAGHSNQESGKVFGSL --MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTPNNVAFKVTGKSTHDK-VTSGAL Sequenze allineate Finestra di Input Artificial Neural Networks Percettrone a singolo strato z1 Outputs zm d a= w S i=0 i xi z = g (a) Bias x0 x1 Inputs xd La Funzione di Errore Y i (X q) = Output of the network D iq = Expected Value L’ Algoritmo di Training: il Back Propagation (gradient descendent: Rumelhart et al. 1986) Correction to the weights m = learning rate h = momentum term Parametri variabili delle Reti Neurali • Il codice di input • L’ampiezza della finestra mobile • L’architettura: il numero di nodi (neuroni) e gli strati di neuroni • La velocità di apprendimento Le Reti Neurali a Bologna predicono: •La struttura secondaria delle proteine •I siti di iniziazione del protein folding •La topologia delle proteine di membrana all alpha and all beta (ISMB BEST PAPER AWARD 2002) •La presenza dei peptidi segnale •Lo stato di legame delle cisteine e la topologia dei ponti a zolfo •Le mappe di contatto delle proteine (BEST PREDICTOR of the CATEGORY at CASP4) •Le superfici di interazione tra proteine www.biocomp.unibo.it Schema generale dei predittori disponibili al nostro sito web Predittori basati su Reti Neurali Verso la predizione della struttura 3D: La predizione delle mappe dei contatti Predizione dei contatti tra residui Contatti nelle Proteine F 156 F 297 V 299 I 269 V 238 V 271 I 240 Computation of Contact Maps From 3D Structure F 156 F 297 I 269 V 238 V 299 V 271 I 240 To Contact Map T T C C P S I V A R S N F N V C R L P G T P E A I C A T Y T G C I I I P G A T C P G D Y A N TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN 3-D Modelling through Contact Maps Bacteriorhodopsin Model 1QHJ (1.9 Å) N Contact map MARC C RMSD = 2.5 Å Tools di Apprendimento Automatico Le Reti Neurali imparano il mapping dalla sequenza alla mappa dei contatti Training Predizione Set Banca Dati Sequenza TTCCPSIVARSNFNVCRLPGTPEAICATYTGCIIIPGATCPGDYAN Regole generali Mapping noto Predizione della mappa dei contatti T0087: 310 residues A=20 % (FR/NF) C N T0110: 128 residues A=30% (NF) N C Predittori basati su Reti Neurali Verso la predizione della struttura 3D: La predizione dei ponti a zolfo Il Protein Folding RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTF VYGGCRAKRNNFKSAEDCMRTCGGA I legami a zolfo tra cisteine nelle proteine S Ca C Ca S C 2-SH -> -SS- + 2H+ + 2eS-S distance 2.2 Å Torsion angle C-S-S-C 90° Bond Energy 3 Kcal/mol Neural Networks for the Prediction of the disulfide-bonding state of cysteines in proteins Bonding 1 2 3 4 5 6 7 8 9 10 11 12 13 Non bonding MVKGPGLYTDIGKKARDLLYKDYHS--DKKFTISTYSCTGVAITSSGTKKGEL--FLGDV SAKGPGLYTDIGKKARDLLYRDYQT--DQKFSITTYSCTGVAITSSGTKKGDL--FLADV MVKGPGLYSDIGKRARDLLYRDYQS--DHKFTLTTYTCNGVAITSTGTKKGEL--FLADV MVKGPGLYSDIGKKARDLLYRDYVS--DHKFTVTTYSCTGVAITASGLKKGEL--FLADV MVKGPGLYTEIGKKARDLLYRDYQG--DQKFSVTTYSCTGVAITTTGTNKGSL--FLGDV MVVAVGLYTDIGKKTRDLLYKDYNT--HQKFCLTTSSCNGVAITAAGTRKNES--IFGEL -MGGPGLYSGIGKKAKDLLYRDYQT--DHKFTLTTYTCNGPAITATSTKKADL--TVGEI AVVRPYADLGKSARDVFTKGYGFG-LIKLDLKTKSENGLEFTSSGSANTETTKVTGSLEI --AVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVTGSL -MAVPPTYADLGKSARDVFTKGYGFG-LIKLDLKTKSGNGLEFTSSGSANTETTKVNGSL --AVPPSYADLGKSARDIFNKGYGFG-LVKLDVKTKSCTGVEFTTSGTSNTDSGKVNGSL --MAPPSYSDLGKQARDIFSKGYNFG-LWKLDLKTKTCSGIEFNTAGHSNQESGKVFGSL --MAVPAFSDIAKSANDLLNKDFYHLAAGTIEVKSNTCNNVAFKVTGKSTHDK-VTSGAL W1 W2 W3 MYSFPNSFRFGWSQAGFQCEMSTPGSEDPNTDWYKWVHDPENMAAGLCSGDLPENGPGYWGNYKTFHDNAQKMCLKIARLNVEWSRIFPNP... P(B|W1), P(F|W1) P(B|W2), P(F|W2) P(B|W3), P(F|W3) Begi n Cysteine free states Cysteine bonding states End Most probable path through the states Prediction of the bonding and non-bonding states of all the cysteines of the sequence Il sistema ibrido Accuratezza per cisteina: 88%; per proteina: 84% NN-based predictor HNN-based predictor 100 90 Correctly predicted proteins (%) 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 187 207 106 144 71 80 35 55 18 16 4 16 0 7 1 8 2 4 0 0 0 1 1 3 0 0 0 0 0 1 0 0 0 1 1 No of cysteines per protein No of proteins Protein Science, in press Output Input www.prion.biocomp.unibo.it/cyspred.html VGDKLIPLKITYDYYVCNNH MDTDTSYERWPALGTYRPLN GRDCVMNNHKLAASDRWECD VGDKLIPLKITYDYYVCNNHMDTDTSYERWPA QREPLYTCMCNKDLPTKAAG LGTYRPLNGRDCVMNNHKLAASDRWECDQREP LYTCMCNKDLPTKAAGPLMNTRPILNLSREEW PLMNTRPILNLSREEWLLPL LLPLLTHMNVVAGLCKLP LTHMNVVAGLCKLP VGDKLIPLKITYDYYVCNNHMDTDTSYERWPALG TYRPLNGRDCVMNNHKLAASDRWECDQREPLYTC MCNKDLPTKAAGPLMNTRPILNLSREEWLLPLLT HMNVVAGLCKLP Disulfide bonding cysteine Free cysteine I PREDITTORI POSSONO ESSERE USATI PER SCOPRIRE NUOVE PROTEINE? Escherichia coli K12, genoma completo Completed: Oct 13, 1998. Total Bases: 4,639,221 bp NCBI (www.ncbi.nlm.nih.gov) Protein coding genes: 4,289 Structural RNAs: 115 EcoGene/EcoProt (bmb.med.miami.edu/EcoGene) Protein coding genes: 4,173 Structural RNAs : 120 EcoGene/SwissProt functional annotation Keywords of SwissProt entries (if exist) are extracted : 2160 421 35 1704 ANNOTATED PROTEINS (52 %) Inner membrane proteins Outer membrane proteins Globular proteins 760 PARTIALLY ANNOTATED PROTEINS (18 %) proteins annotated as “Hypothetical proteins” and with other functional annotations 352 Inner membrane proteins 18 Outer membrane proteins 390 Globular proteins 1253 NON ANNOTATED PROTEINS (30 %) 137 proteins don’t have SwissProt entry 1116 proteins don’t contain functional annotation in SwissProt Outer Membrane proteins (all b-Transmembrane proteins) Inner Membrane proteins (all a-Transmembrane proteins) PROTEOME HUNTER Signal peptide All-a TM All-a TM All-b TM Globular all a-TM Globular all b-TM all a-TM Predicting globular, inner and outer membrane proteins in genomes of Gram-negative bacteria with Hunter Organism Escherichia coli K12 New* Escherichia coli O157:H7 New Chlamidia pneumoniae CWL029 New Salmonella typhimurium LT2 New Neisseria meningitidis MC58 New Helicobacter pylori 26695 New Haemophylus influentiae Rd New Thermotoga maritima New Pseudomonas aeruginosa New Outer membrane 65 (1.6%) 18 78 (1.5%) 10 12 (1.1%) 2 70 (1.6%) 0 34 (1.7%) 6 36 (2.3%) 10 23 (1.3%) 5 18 (1.0%) 11 131 (2.4%) 62 Inner membrane 907 (21.7%) 136 1034 (19.3%) 327 290 (27.6%) 181 1002 (22.5%) 2 372 (18.4%) 176 352 (22.5%) 141 348 (20.4%) 121 370 (20.0%) 203 1292 (23.2%) 616 Globular 3201 (76.7%) 1099 4249 (79.2%) 1564 750 (71.3%) 236 3379 (75.9%) 21 1619 (80.0%) 662 1178 (75.2%) 445 1338 (78.3%) 430 1458 (79.0%) 559 4142 (74.4%) 1867 Total 4173 1253 5361 1901 1052 419 4451 23 2025 844 1566 596 1709 556 1846 773 5565 2545 * the number of new proteins predicted in the class with Hunter, out of the non-annotated region http://www.biocomp.unibo.it www.biocomp.unibo.it Welcome to the CIRB Biocomputing Group home page This is the Biocomputing unit of the CIRB Centro Interdipartimentale di Ricerche Biotecnologiche Group Main Research Fields. Group Publications Technology provider for the DRUG consortium of the NOTSOMAD TTN initiative. BIOCOMPUTING GROUP Group leader : Rita Casadio Group members: Piero Fariselli Mario Compiani Pier Luigi Martelli Emidio Capriotti Ivan Rossi Gianluca Tasco Collaborazioni Italia L.Masotti, Biochemistry, Bologna M.Rossi, IBPE/CNR, Napoli G.Mita, IIGB/CNR, Napoli G.Irace, Biochemistry, Napoli D.Boraschi, CNR, Pisa P.Arrigo, ICE/CNR, Genova P.Mariani, Physics, Ancona G.Campadelli-Fiume, Pathology, Bologna S.Prosperi, Veterinary, Bologna F.Bernardi, Chemistry, Bologna S.Ciurli, Agricultural Chemistry, Bologna C.Bergamini, Biochemistry, Ferrara All’estero B.Rost, Columbia University, New York A.Valencia, Protein Design Group, Cantoblanco, Madrid P.Baldi, Genomics and Bioinformatics, Irvine, California A.Krogh, University of Copenhagen, Copenhagen N.Ben Tal, Israel Insitute of Technology, Tel Aviv The cross validation procedure Protein set Training set Testing set Evaluation of the performance Q2 = Q(x) = p+n correct predictions ———————— = —— N total predictions correct predictions in class x p ———————————— = —— total observations in class x p+u P(x) = correct predictions in class x p ———————————— = —— total predictions in class x p+o C p·n - o ·u Correlation index = ————————————— [(p+o) ·(p+u) ·(n+o) ·(n+u)]1/2 = Legend: Observed Predicted x Non-x x p o Non-x u n Evaluation of the efficiency of contact map predictions 1) Accuracy: A = Ncp* / Ncp where Ncp* and Ncp are the number of correctly assigned contacts and that of total predicted contacts, respectively. 2) Improvement over a random predictor : R = A / (Nc/Np) where Nc/Np is the accuracy of a random predictor ; Nc is the number of real contacts in the protein of length Lp, and Np are all the possible contacts 3) Difference in the distribution of the inter-residue distances in the 3D structure for predicted pairs compared with all pair distances in the structure (Pazos et al., 1997): Xd=S i=1,n (Pic - Pia ) / n di where n is the number of bins of the distance distribution (15 equally distributed bins from 4 to 60Å cluster all the possible distances of residue pairs observed in the protein structure); di is the upper limit (normalised to 60 Å) for each bin, e.g. 8 Å for the 4 to 8 Å bin; Pic and Pia are the percentage of predicted contact pairs (with distance between di and di-1 ) and that of all possible pairs respectively The cross validation procedure Protein set Testing set 1 Training set 1 PRINCIPI DI BASE DELLA STRUTTURA DELLE PROTEINE Gli elementi della costruzione della struttura primaria Amminoacidi Backbone della proteina