An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data Cristina Manfredotti Dipartimento di Informatica, Sistemistica e Comunicazione (D.I.S.Co.) Università degli Studi Milano-Bicocca [email protected] Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Introduction • A central goal of molecular biology is to understand the regulation of protein synthesis. • DNA microarray experiments can measure thousands of gene expression levels simultaneously. • An important challenge is to develop methodologies that are both statistically sound and computationally tractable. • Bayesian network learning. Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Biological Background • DNA DNA is a double-stranded molecule Hereditary information is encoded • Gene Gene is a segment of DNA Contain the information required to make a protein Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Motivations • Each gene encodes a protein and proteins are the functional units of life • Every gene is present in every cell, but only a fraction of the genes are expressed at any time • Many diseases result from the interaction between genes • Understanding the mechanisms that determine which genes are expressed, and when they are expressed, is the key to the development of new treatments of diseases Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Bayesian Networks • Prior work— Clustering of expression data Groups together genes with similar expression pattern Disadvantage: does not reveal structural relations between genes • Big challenge Extract meaningful information from the expression data Discover interactions between genes based on the measurements Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Bayesian Networks • A Bayesian Network (BN) is a graphical representation of a probability distribution Compact & intuitive representation Useful for describing processes composed of locally interacting components Have a good statistical foundation Efficient model learning algorithm Capture causal relationships Deals with noisy data Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Representing Distributions • A Bayesian networks is a representation of a joint probability distribution. • A Bayesian network has two components. – G: a directed-acyclic graph structure – Θ: a set of parameters for conditional distribution of each variable • The joint probability distribution of {X1, …, Xn} is represented by Bayesian Network as follows: n P ( X 1 ,..., X n ) = ∏i =1 P( X i | Pa G ( X i )), – where PaG(Xi) is the set of parents of Xi given the graph G Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca An Example of a Simple BN Gene E - Gene B and Gene D are independent Gene A given Gene A. - Gene B asserts dependency between Gene A and Gene E. Gene B Gene D - Gene A and Gene C are independent given Gene B. P ( A, B, C , D, E ) Gene C = P( A) P( B | A) P(C | A, B ) P ( D | A, B, C ) P( E | A, B, C , D) = P( A) P( B | A, E ) P(C | B) P( D | A) P ( E ) Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Learning Bayesian Networks • Given a training set D = {x1, …, xN} of independent instances of X, find a network B = <G, Θ> that best matches D. • The score function for a network is defined as, P( D | G ) P(G ) S (G : D) = P(G | D) = P( D) where P ( D | G ) = ∫ P( D | G, Θ) P (Θ | G ) dΘ is the marginal likelihood which averages the probability of the data over all possible parameter assignments to G. Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Learning Bayesian Networks Directed-acyclic graph structure G: Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Learning Bayesian Networks Directed-acyclic graph structure G: Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca A simple example • We want to construct a BN of a system composed of 3 genes (A, B and C) that can be ON or OFF • Given the training set D • Fix a number of iteration M • Choose (randomly) M structures GJ (binarysquared matrix) • Learn the Conditional Probability Table • Choose the graph that has the max score. Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca A simple example D: A B C 0 0 0 0 1 1 0 1 1 0 0 1 ……… 0 0 1 1 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 ……… 1 1 1 Cristina Manfredotti |D| = 13 M=6 D.I.S.Co. Università di Milano - Bicocca Structures: G1 Gj: AA AB AC BA BB BC CA CB CC 0 1 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 1 1 0 0 0 1 1 0 0 1 0 0 0 G5 Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca P(A=0) = 6/13 G 1: A B P(A=1) = 7/13 C B\A 0 1 0 4/13 0 1 2/13 7/13 Cristina Manfredotti C\A,B 00 01 10 11 0 1/13 0 0 4/13 1 0 5/13 0 3/13 D.I.S.Co. Università di Milano - Bicocca G 5: A B A\B,C 00 01 10 11 0 1/13 3/13 0 2/13 1 0 0 4/13 3/13 C C\B 0 1 0 1/13 4/13 1 3/13 5/13 P(B=0) = 4/13 P(B=1) = 9/13 Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca A simple example D: A B C 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 1 1 1 0 1 1 0 ……… Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca G 1: P([0 0 0]|G1)P(G1) = 6/13*4/13*1/13*2/6 A P([0 1 1]|G1)P(G1) = 6/13*2/13*5/13*2/6 B C P([0 0 1]|G1)P(G1) = 6/13*4/13*0*2/6 … Score = 1/n∑ ∑ P(Di|G1) Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca G 5: P([0 0 0]|G5)P(G5) = 1/13*4/13*1/13*1/6 A P([0 1 1]|G5)P(G5) = 2/13*9/13*5/13*1/6 P([0 0 1]|G5)P(G5) = 3/13*4/13*3/13*1/6 B C … Score = 1/n∑ ∑ P(Di|G5) Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Analyzing Expression Data • Practical problem — Small data sets variables — hundreds of or thousands of genes samples — just tens of microarray experiments • On the positive side, genetic regulation networks are sparse!!! • Characterize and learn features that are common to most of these networks Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Analyzing Expression Data: • The first feature — Markov relations Symmetric relation: Y is in X’s Markov blanket iff there is either an edge between them, or both are parents of another variable (Pearl 98). Biological interpretation: a Markov relation indicates that the two genes are related in some joint biological interaction or process Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Analyzing Expression Data: • The second feature — order relations Global property: A is an ancestor of B in all the equivalent Bayesian networks learned Biological interpretation: an order relation indicates that the transcription of one gene is a direct cause of the transcription of another gene A B Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Estimating Statistical Confidence in Features • To what extent does the data support a given feature? • effective and relatively simple approach for estimating confidence: bootstrap method. • For i = 1, …, m – Re-sample with replacement N instances from D. Denote by Di the resulting dataset. – Apply the learning procedure on Di to induce a network structure G. • For each feature f of interest calculate conf ( f ) = 1 m ∑i =1 – where f(G) is 1 if f is a feature in G, and 0 otherwise. Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca m f (Gi ) How to collect data: • • • • • • Gene knock down Gene knock out Compound Tessue microarray Time course … Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca Where are we going Cristina Manfredotti D.I.S.Co. Università di Milano - Bicocca