Bioinformatica banche dati genomiche + algoritmo* •Allineamento di sequenze •Individuazione di geni •Ricostruzione di genomi da frammenti •Allineamenti di strutture proteiche •Predizione di strutture proteiche •Predizione di espressione genica •Predizione di interazioni tra proteine * algoritmo= procedimento che risolve un determinato problema attraverso un numero finito di passi logici o computazionali. 1 2 The first DNA-based genome to be sequenced in its entirety was that of bacteriophage Φ-X174; (5,368 bp), sequenced by Frederick Sanger in 1977. 3 The first DNA-based genome to be sequenced in its entirety was that of bacteriophage Φ-X174; (5,368 bp), sequenced by Frederick Sanger in 1977. There are several things to notice in this plot. First, the genome is circular. The density of the four nucleotides are plotted in the four outer-most circles. This density is not evenly distributed; although all four of the scales range from 0% (min., no colour) to 40% (max colour intensity), it can be easily seen that the sequence is dominated by T's (red circle), and that there are relatively few G's (outermost turquoise circle) and C's (pink circle), and a few A-rich regions (green 2nd circle). There are many genes which overlap (the genes are indicated in the "annotation circle", which is the fifth circle from the outside - with the blue bands representing genes in the forward direction). http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/P/PhiX.html GC Skew = (G - C)/(G + C) AT Skew = (A - T)/(A + T) 4 5 6 http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html 7 8 9 10 11 12 13 Circular maps of the chromosome and plasmids of enteropathogenic E. coli (da Iguchi A et al. J. Bacteriol. 2009) CDS = CoDing Sequence, region of nucleotides that corresponds to the sequence of amino acids in the predicted protein PP = prophage: a phage (viral) genome inserted and integrated into the circular bacterial DNA chromosome IE = integrative elements Circular maps of the chromosome and plasmids of EPEC strain E2348/69. (A) EPEC strain E2348/69 chromosome. From the outside in, the first circle shows the locations of PPs and IEs (purple, lambda-like PPs; light blue, other PPs; green, IEs and the LEE element), the second circle shows the nucleotide sequence positions (in Mbp), the third and fourth circles show CDSs transcribed clockwise and anticlockwise, respectively (gray, conserved in all eight other sequenced E. coli strains; red, conserved only in the B2 phylogroup; yellow, variable distribution; blue, E2348/69 specific), the fifth circle shows the tRNA genes (red), the sixth circle shows the rRNA operons (blue), the seventh circle shows the G+C content, and the eighth circle shows the GC skew. (B) EPEC strain E2348/69 plasmids. The boxes in the outer and inner circles represent CDSs transcribed clockwise and anticlockwise, respectively. Pseudogenes 14 are indicated by black boxes, and other CDSs are indicated by the colors described above for panel A. 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Bioinformatica banche dati genomiche + algoritmo* •Allineamento di sequenze •Individuazione di geni •Ricostruzione di genomi da frammenti •Allineamenti di strutture proteiche •Predizione di strutture proteiche •Predizione di espressione genica •Predizione di interazioni tra proteine * algoritmo= procedimento che risolve un determinato problema attraverso un numero finito di passi logici o computazionali. 29 Bioinformatics genomes transciptomes proteomes 30 Bioinformatics genomes transciptomes proteomes 31 Bioinformatics genomes transciptomes genome is the entirety of an organism's hereditary information. It is encoded either in DNA or, for some viruses, in RNA. transcriptome is the set of all RNA molecules, including mRNA, rRNA, tRNA, and other non-coding RNA produced in one or a population of cells. proteomes proteome is the entire complement of proteins expressed by a genome, cell, tissue or organism. More specifically, it is the set of expressed proteins at a given time under defined conditions 32 Databases in Bioinformatics Type of data nucleotide sequences protein sequences proteins sequence patterns or motifs macromolecular 3D structure gene expression data metabolic pathways Data entry and quality control Scientists (teams) deposit data directly Appointed curators add and update data Are erroneous data removed or marked? Type and degree of error checking Consistency, redundancy, conflicts, updates Technical design Flat-files Relational database (SQL) Object-oriented database Exchange/publication technologies (FTP, HTML, CORBA, XML,...) Maintainer status Large, public institution funded by government (EMBL, NCBI) Quasi-academic institute (Swiss Institute of Bioinformatics, TIGR) Academic group or scientist Commercial company Primary or derived data Primary databases: experimental results Availability directly into database Publicly available, no restrictions Secondary databases: results of analysis Available, but with copyright of primary databases Accessible, but not downloadable Aggregate of many databases Academic, but not freely available Links to other data items Proprietary, commercial; possibly free for Combination of data academics Consolidation of data 33 Structural genomics NMR DNA 0101#01001010#10111010# 01010001#10010#1001#101 10010#100100100101011#0 Algorithm Residue THR 0.0 147.7 172.9 THR 107.2 -125.3 187.4 CYS 123.4 63.6 103.7 PRO 60.3 83.9 -116.7 Protein Structure 34 secondary structure prediction 35 secondary structure prediction 36 secondary structure prediction 37 secondary structure prediction CHOU & FASMAN 38 Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222. secondary structure prediction # residues in window: 6 39 secondary structure prediction 40 secondary structure prediction 41 secondary structure prediction 42 secondary structure prediction 43 secondary structure prediction PSIPRED is a simple and reliable secondary structure prediction method, incorporating two feedforward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST). 44 secondary structure prediction 45 secondary structure prediction 46 47 Allineamento di sequenze Sequenze (geniche o proteiche) isolate possiedono un limitato livello di informazione sulla propria funzione. Le stesse sequenze possono mostrare le regioni critiche per la propria struttura e funzione, una volta allineate con sequenze che mostrano elevati livelli di similitudine. “Una o due sequenze omologhe sussurrano; un allineamento multiplo parla a voce alta” (Arthur Lesk) 48 Allineamento di sequenze I geni e le proteine tendono ad evolvere e differenziarsi da antenati comuni, accumulando progressivamente mutazioni geniche che, di solito, escludono regioni funzionali. Per allineare due sequenze si fa uso di algoritmi, ovvero di metodi di calcolo che utilizzano un numero finito di regole di calcolo e di operazioni per ottenere un risultato. Un algoritmo deve poi essere tradotto in un programma, utilizzando un linguaggio 49 di programmazione adeguato, per esempio compilato in C o in Pearl. 50 51 tertiary structure prediction 52 tertiary structure prediction 53 tertiary structure prediction 54 Amino Acid Code A Alanine B Aspartic acid or Asparagine C D E F G H I K L M N O P Q R S T U V W Y Z teminates with > Meaning X * - Cysteine Aspartic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine Asparagine Pyrrolysine Proline Glutamine Arginine Serine Threonine Selenocysteine Valine Tryptophan Tyrosine Glutamic acid or Glutamine any translation stop gap of indeterminate length 55 dotplot The dotplot is a simple picture that gives an overview of the similarities between two sequences. Less obvious is its close relationship to alignments. The dotplot is a table or matrix. The rows correspond to the residues of one sequence and the columns to the residues of the other sequence. In its simplest form, the positions in the dotplot are left blank if the residues are different, and filled if they match. Stretches of similar residues show up as diagonals in the upper left-lower right (Northwest-Southeast) direction. Dotplot showing identities between a repetitive sequence (ABRACADABRACADABRA) and itself. The repeats appear on several subsidiary diagonals parallel to the main diagonal. Letters corresponding to isolated matches are shown in non-bold type. The longest matching regions, shown in boldface, are the first and last names DOROTHY and HODGKIN. Shorter matching regions, such as the OTH of dorOTHy and crowfoOTHodgkin, or the RO of doROthy and cROwfoot, are noise. From Introduction to Bioinformatics by Arthur M. Lesk Dotplot showing identities between the palindromic sequence MAX I STAY AWAY AT SIX AM and itself. The palindrome reveals itself as a stretch of matches perpendicular to the main diagonal. 56 57 58 BLOSUM62 matrix does an excellent job detecting similarities in distant sequences, and this is the matrix used by default in most recent alignment applications such as BLAST 59 Mutation probability matrix for the evolutionary distance of 250 PAMs 60 61 62 63 TABELLA RIASSUNTIVA DELLE DIFFERENZE FRA FASTA E BLAST FASTA BLAST OMOLOGIA Globale Locale (LFASTA) Locale USO DELLA SCORING MATRIX Durante la 2a fase (estensione) Fase di scansione Fase di estensione K-TUPLE 1-2 aa / 4-6 nt 3 aa / 11-12nt GAP Consentiti nella 4a fase Mai consentiti VELOCITA' Da 1/2 ad 1/5 di BLAST Da 2 a 5 volte maggiore di FASTA Migliore per il confronto di sequenze proteiche Migliore per il confronto di sequenze nucleotidiche SPECIFICITA' 64 W. Pearson, autore di FASTA, consiglia di usare nell'ordine: 1. 2. 3. 4. BLAST FASTA con ktup=2 FASTA con ktup=1 Programmi basati sull'algoritmo di programmazione dinamica (es. SSEARCH) 65 66 67 tertiary structure prediction 68 tertiary structure prediction 69 tertiary structure prediction 70 Protein folding ab initio calculations of protein structure 71 Metodo Assemblaggio di frammenti: Dividendo la sequenza in frammenti MSSPQAPEDGQGCGDRGDPPGDLRSVLVTTV ROSETTA Frammenti di 9 aa Sceglie le strutture delle 25 sequenze più vicine Ottimizzazione e Assemblaggio (Knowledge-based potential) 72 Rosetta Fragment Libraries 25-200 fragments for each 3 and 9 residue sequence window Selected from database of known structures > 2.5Å resolution < 50% sequence identity Ranked by sequence similarity and similarity of predicted and known secondary structure 73 74 75 PYMOL collegarsi a www.sienabiografix.it/edu, sotto Lezioni a.a. 2012-2013: scaricare python 2.7 e installare scaricare pymol 1.4 e installare lanciare con C:\Python27\PyMOL\pymol.cmd aiuto via e-mail da Edoardo Morandi: [email protected] 76