Bioinformatica
banche dati genomiche + algoritmo*
•Allineamento di sequenze
•Individuazione di geni
•Ricostruzione di genomi da frammenti
•Allineamenti di strutture proteiche
•Predizione di strutture proteiche
•Predizione di espressione genica
•Predizione di interazioni tra proteine
*
algoritmo= procedimento che risolve un determinato problema
attraverso un numero finito di passi logici o computazionali.
1
2
The first DNA-based genome to be sequenced in its
entirety was that of bacteriophage Φ-X174; (5,368 bp),
sequenced by Frederick Sanger in 1977.
3
The first DNA-based genome to be sequenced in its
entirety was that of bacteriophage Φ-X174; (5,368 bp),
sequenced by Frederick Sanger in 1977.
There are several things to notice in this plot. First, the genome is circular.
The density of the four nucleotides are plotted in the four outer-most circles.
This density is not evenly distributed; although all four of the scales range
from 0% (min., no colour) to 40% (max colour intensity), it can be easily
seen that the sequence is dominated by T's (red circle), and that there are
relatively few G's (outermost turquoise circle) and C's (pink circle), and a
few A-rich regions (green 2nd circle).
There are many genes which overlap (the genes are indicated in the
"annotation circle", which is the fifth circle from the outside - with the blue
bands representing genes in the forward direction).
http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/P/PhiX.html
GC Skew = (G - C)/(G + C)
AT Skew = (A - T)/(A + T)
4
5
6
http://www.ncbi.nlm.nih.gov/genbank/genbankstats.html
7
8
9
10
11
12
13
Circular maps of the chromosome and plasmids of enteropathogenic E. coli
(da Iguchi A et al. J. Bacteriol. 2009)
CDS = CoDing Sequence, region of nucleotides that corresponds to
the sequence of amino acids in the predicted protein
PP = prophage:
a phage (viral) genome inserted and
integrated into the circular bacterial DNA
chromosome
IE = integrative elements
Circular maps of the chromosome and plasmids of EPEC strain E2348/69. (A) EPEC strain E2348/69 chromosome. From the outside
in, the first circle shows the locations of PPs and IEs (purple, lambda-like PPs; light blue, other PPs; green, IEs and the LEE element),
the second circle shows the nucleotide sequence positions (in Mbp), the third and fourth circles show CDSs transcribed clockwise and
anticlockwise, respectively (gray, conserved in all eight other sequenced E. coli strains; red, conserved only in the B2 phylogroup;
yellow, variable distribution; blue, E2348/69 specific), the fifth circle shows the tRNA genes (red), the sixth circle shows the rRNA
operons (blue), the seventh circle shows the G+C content, and the eighth circle shows the GC skew. (B) EPEC strain E2348/69
plasmids. The boxes in the outer and inner circles represent CDSs transcribed clockwise and anticlockwise, respectively. Pseudogenes
14
are indicated by black boxes, and other CDSs are indicated by the colors described above for panel A.
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Bioinformatica
banche dati genomiche + algoritmo*
•Allineamento di sequenze
•Individuazione di geni
•Ricostruzione di genomi da frammenti
•Allineamenti di strutture proteiche
•Predizione di strutture proteiche
•Predizione di espressione genica
•Predizione di interazioni tra proteine
*
algoritmo= procedimento che risolve un determinato problema
attraverso un numero finito di passi logici o computazionali.
29
Bioinformatics
genomes transciptomes
proteomes
30
Bioinformatics
genomes transciptomes
proteomes
31
Bioinformatics
genomes transciptomes
genome is the entirety of
an organism's hereditary
information. It is encoded
either in DNA or, for some
viruses, in RNA.
transcriptome is the set of all RNA
molecules, including mRNA, rRNA,
tRNA, and other non-coding RNA
produced in one or a population of cells.
proteomes
proteome is the entire
complement of proteins
expressed by a genome, cell,
tissue or organism. More
specifically, it is the set of
expressed proteins at a given
time under defined
conditions
32
Databases in Bioinformatics
Type of data
nucleotide sequences
protein sequences
proteins sequence patterns or motifs
macromolecular 3D structure
gene expression data
metabolic pathways
Data entry and quality control
Scientists (teams) deposit data directly
Appointed curators add and update data
Are erroneous data removed or marked?
Type and degree of error checking
Consistency, redundancy, conflicts,
updates
Technical design
Flat-files
Relational database (SQL)
Object-oriented database
Exchange/publication technologies (FTP,
HTML, CORBA, XML,...)
Maintainer status
Large, public institution funded by
government (EMBL, NCBI)
Quasi-academic institute (Swiss Institute
of Bioinformatics, TIGR)
Academic group or scientist
Commercial company
Primary or derived data
Primary databases: experimental results Availability
directly into database
Publicly available, no restrictions
Secondary databases: results of analysis
Available, but with copyright
of primary databases
Accessible, but not downloadable
Aggregate of many databases
Academic, but not freely available
Links to other data items
Proprietary, commercial; possibly free for
Combination of data
academics
Consolidation of data
33
Structural genomics
NMR
DNA
0101#01001010#10111010#
01010001#10010#1001#101
10010#100100100101011#0
Algorithm
Residue



THR
0.0
147.7
172.9
THR
107.2
-125.3
187.4
CYS
123.4
63.6
103.7
PRO
60.3
83.9
-116.7
Protein Structure
34
secondary structure prediction
35
secondary structure prediction
36
secondary structure prediction
37
secondary structure prediction
CHOU & FASMAN
38
Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222.
secondary structure prediction
# residues in window: 6
39
secondary structure prediction
40
secondary structure prediction
41
secondary structure prediction
42
secondary structure prediction
43
secondary structure prediction
PSIPRED is a simple and reliable
secondary structure prediction
method, incorporating two feedforward neural networks which
perform an analysis on output
obtained from PSI-BLAST (Position
Specific Iterated - BLAST).
44
secondary structure prediction
45
secondary structure prediction
46
47
Allineamento di sequenze
Sequenze (geniche o proteiche) isolate
possiedono un limitato livello di
informazione sulla propria funzione.
Le stesse sequenze possono mostrare
le regioni critiche per la propria
struttura e funzione, una volta
allineate con sequenze che mostrano
elevati livelli di similitudine.
“Una o due sequenze omologhe sussurrano; un allineamento multiplo
parla a voce alta” (Arthur Lesk)
48
Allineamento di sequenze
I geni e le proteine tendono ad evolvere e differenziarsi da antenati comuni, accumulando
progressivamente mutazioni geniche che, di solito, escludono regioni funzionali.
Per allineare due sequenze si fa uso
di algoritmi, ovvero di metodi di calcolo che
utilizzano un numero finito di regole di calcolo e
di operazioni per ottenere un risultato.
Un algoritmo deve poi essere tradotto in un programma, utilizzando un linguaggio
49
di programmazione adeguato, per esempio compilato in C o in Pearl.
50
51
tertiary structure prediction
52
tertiary structure prediction
53
tertiary structure prediction
54
Amino Acid
Code
A
Alanine
B
Aspartic
acid or Asparagine
C
D
E
F
G
H
I
K
L
M
N
O
P
Q
R
S
T
U
V
W
Y
Z
teminates with >
Meaning
X
*
-
Cysteine
Aspartic acid
Glutamic acid
Phenylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
Asparagine
Pyrrolysine
Proline
Glutamine
Arginine
Serine
Threonine
Selenocysteine
Valine
Tryptophan
Tyrosine
Glutamic
acid or Glutamine
any
translation stop
gap of indeterminate
length
55
dotplot
The dotplot is a simple picture that gives an overview of the similarities between two sequences. Less obvious is
its close relationship to alignments.
The dotplot is a table or matrix. The rows correspond to the residues of one sequence and the columns to the
residues of the other sequence. In its simplest form, the positions in the dotplot are left blank if the residues are
different, and filled if they match. Stretches of similar residues show up as diagonals in the upper left-lower
right (Northwest-Southeast) direction.
Dotplot showing identities between a repetitive
sequence (ABRACADABRACADABRA) and itself.
The repeats appear on several subsidiary
diagonals parallel to the main diagonal.
Letters corresponding to isolated matches are shown in non-bold
type. The longest matching regions, shown in boldface, are the
first and last names DOROTHY and HODGKIN. Shorter matching
regions, such as the OTH of dorOTHy and crowfoOTHodgkin, or
the RO of doROthy and cROwfoot, are noise.
From Introduction to Bioinformatics
by Arthur M. Lesk
Dotplot showing identities between the
palindromic sequence MAX I STAY
AWAY AT SIX AM and itself. The
palindrome reveals itself as a stretch of
matches perpendicular to the main
diagonal.
56
57
58
BLOSUM62 matrix does an excellent job detecting similarities in distant
sequences, and this is the matrix used by default in most recent alignment
applications such as BLAST
59
Mutation probability matrix for the evolutionary distance of 250 PAMs
60
61
62
63
TABELLA RIASSUNTIVA DELLE DIFFERENZE
FRA FASTA E BLAST
FASTA
BLAST
OMOLOGIA
Globale
Locale (LFASTA)
Locale
USO DELLA
SCORING MATRIX
Durante la 2a fase
(estensione)
Fase di scansione
Fase di estensione
K-TUPLE
1-2 aa / 4-6 nt
3 aa / 11-12nt
GAP
Consentiti nella
4a fase
Mai consentiti
VELOCITA'
Da 1/2 ad 1/5 di
BLAST
Da 2 a 5 volte
maggiore di FASTA
Migliore per il
confronto
di sequenze
proteiche
Migliore per il
confronto
di sequenze
nucleotidiche
SPECIFICITA'
64
W. Pearson, autore di FASTA, consiglia di
usare nell'ordine:
1.
2.
3.
4.
BLAST
FASTA con ktup=2
FASTA con ktup=1
Programmi basati sull'algoritmo di
programmazione dinamica (es. SSEARCH)
65
66
67
tertiary structure prediction
68
tertiary structure prediction
69
tertiary structure prediction
70
Protein folding
ab initio calculations of protein structure
71
Metodo Assemblaggio di frammenti:
Dividendo la sequenza in frammenti
MSSPQAPEDGQGCGDRGDPPGDLRSVLVTTV
ROSETTA
Frammenti di 9 aa
Sceglie le strutture delle 25
sequenze più vicine
Ottimizzazione e Assemblaggio
(Knowledge-based potential)
72
Rosetta Fragment Libraries
 25-200 fragments for each 3 and 9 residue
sequence window
 Selected from database of known structures
> 2.5Å resolution
< 50% sequence identity
 Ranked by sequence similarity and similarity of
predicted and known secondary structure
73
74
75
PYMOL
collegarsi a www.sienabiografix.it/edu, sotto Lezioni a.a. 2012-2013:
scaricare python 2.7 e installare
scaricare pymol 1.4 e installare
lanciare con
C:\Python27\PyMOL\pymol.cmd
aiuto via e-mail da Edoardo Morandi: [email protected]
76
Scarica

secondary structure prediction