An Introduction to the Use of
Bayesian Network to Analyze Gene
Expression Data
Cristina Manfredotti
Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.)
Università degli Studi Milano-Bicocca
[email protected]
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Introduction
• A central goal of molecular biology is to
understand the regulation of protein
synthesis.
• DNA microarray experiments can
measure thousands of gene expression
levels simultaneously.
• An important challenge is to develop
methodologies that are both statistically
sound and computationally tractable.
• Bayesian network learning.
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Biological Background
• DNA
DNA is a double-stranded molecule
Hereditary information is encoded
• Gene
Gene is a segment of DNA
Contain the information required
to make a protein
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Motivations
• Each gene encodes a protein and proteins are
the functional units of life
• Every gene is present in every cell, but only a
fraction of the genes are expressed at any time
• Many diseases result from the interaction
between genes
• Understanding the mechanisms that determine
which genes are expressed, and when they are
expressed, is the key to the development of
new treatments of diseases
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Bayesian Networks
• Prior work— Clustering of expression
data
Groups together genes with similar expression pattern
Disadvantage: does not reveal structural relations
between genes
• Big challenge
Extract meaningful information from the expression data
Discover interactions between genes based on the
measurements
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Bayesian Networks
• A Bayesian Network (BN) is a graphical
representation of a probability distribution
Compact & intuitive representation
Useful for describing processes composed of locally
interacting components
Have a good statistical foundation
Efficient model learning algorithm
Capture causal relationships
Deals with noisy data
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Representing Distributions
• A Bayesian networks is a representation of a
joint probability distribution.
• A Bayesian network has two components.
– G: a directed-acyclic graph structure
– Θ: a set of parameters for conditional
distribution of each variable
• The joint probability distribution of {X1, …, Xn} is
represented by Bayesian Network as follows:
n
P ( X 1 ,..., X n ) = ∏i =1 P( X i | Pa G ( X i )),
– where PaG(Xi) is the set of parents of Xi given
the graph G
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
An Example of a Simple BN
Gene E
- Gene B and Gene D are independent
Gene A
given Gene A.
- Gene B asserts dependency between
Gene A and Gene E.
Gene B
Gene D
- Gene A and Gene C are independent
given Gene B.
P ( A, B, C , D, E )
Gene C
= P( A) P( B | A) P(C | A, B ) P ( D | A, B, C ) P( E | A, B, C , D)
= P( A) P( B | A, E ) P(C | B) P( D | A) P ( E )
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Learning Bayesian Networks
• Given a training set D = {x1, …, xN} of
independent instances of X, find a network B =
<G, Θ> that best matches D.
• The score function for a network is defined as,
P( D | G ) P(G )
S (G : D) = P(G | D) =
P( D)
where
P ( D | G ) = ∫ P( D | G, Θ) P (Θ | G ) dΘ
is the marginal likelihood which averages the
probability of the data over all possible
parameter assignments to G.
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Learning Bayesian Networks
Directed-acyclic graph structure G:
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Learning Bayesian Networks
Directed-acyclic graph structure G:
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
A simple example
• We want to construct a BN of a system
composed of 3 genes (A, B and C) that can
be ON or OFF
• Given the training set D
• Fix a number of iteration M
• Choose (randomly) M structures GJ (binarysquared matrix)
• Learn the Conditional Probability Table
• Choose the graph that has the max score.
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
A simple example
D:
A B C
0 0 0
0 1 1
0 1 1
0 0 1 ………
0 0 1 1 1 0
0 0 1 1 1 0
1 1 0 1 1 1
1 1 0 1 1 1
……… 1 1 1
Cristina Manfredotti
|D| = 13
M=6
D.I.S.Co. Università di Milano - Bicocca
Structures:
G1
Gj:
AA AB AC BA BB BC CA CB CC
0
1
1
0
0
1
0
0
0
0
0
1
1
0
1
0
0
0
0
1
1
0
0
0
0
1
0
0
0
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
0
0
1
1
0
0
1
0
0
0
G5
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
P(A=0) = 6/13
G 1:
A
B
P(A=1) = 7/13
C
B\A
0
1
0
4/13
0
1
2/13
7/13
Cristina Manfredotti
C\A,B
00
01
10
11
0
1/13
0
0
4/13
1
0
5/13
0
3/13
D.I.S.Co. Università di Milano - Bicocca
G 5:
A
B
A\B,C
00
01
10
11
0
1/13
3/13
0
2/13
1
0
0
4/13
3/13
C
C\B
0
1
0
1/13
4/13
1
3/13
5/13
P(B=0) = 4/13
P(B=1) = 9/13
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
A simple example
D:
A B C
0 0 0
0 1 1
0 1 1
0 0 1
0 0 1
0 0 1
1 1 0
1 1 0
………
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
G 1:
P([0 0 0]|G1)P(G1) =
6/13*4/13*1/13*2/6
A
P([0 1 1]|G1)P(G1) =
6/13*2/13*5/13*2/6
B
C
P([0 0 1]|G1)P(G1) =
6/13*4/13*0*2/6
…
Score = 1/n∑
∑ P(Di|G1)
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
G 5:
P([0 0 0]|G5)P(G5) =
1/13*4/13*1/13*1/6
A
P([0 1 1]|G5)P(G5) =
2/13*9/13*5/13*1/6
P([0 0 1]|G5)P(G5) =
3/13*4/13*3/13*1/6
B
C
…
Score = 1/n∑
∑ P(Di|G5)
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Analyzing Expression Data
• Practical problem — Small data sets
variables — hundreds of or thousands of genes
samples — just tens of microarray experiments
• On the positive side, genetic regulation
networks are sparse!!!
• Characterize and learn features that are
common to most of these networks
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Analyzing Expression Data:
• The first feature — Markov relations
Symmetric relation: Y is in X’s Markov blanket
iff there is either an edge between them, or both
are parents of another variable (Pearl 98).
Biological interpretation: a Markov relation indicates
that the two genes are related in some joint biological
interaction or process
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Analyzing Expression Data:
• The second feature — order relations
Global property: A is an ancestor of B in all the
equivalent Bayesian networks learned
Biological interpretation: an order relation indicates
that the transcription of one gene is a direct cause of
the transcription of another gene
A
B
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Estimating Statistical Confidence in
Features
• To what extent does the data support a given feature?
• effective and relatively simple approach for estimating
confidence: bootstrap method.
• For i = 1, …, m
– Re-sample with replacement N instances from D.
Denote by Di the resulting dataset.
– Apply the learning procedure on Di to induce a
network structure G.
• For each feature f of interest calculate
conf ( f ) = 1
m ∑i =1
– where f(G) is 1 if f is a feature in G, and 0 otherwise.
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
m
f (Gi )
How to collect data:
•
•
•
•
•
•
Gene knock down
Gene knock out
Compound
Tessue microarray
Time course
…
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Where are we going
Cristina Manfredotti
D.I.S.Co. Università di Milano - Bicocca
Scarica

G - Dipartimento di Informatica Sistemistica e Comunicazione