Supplementary material for our paper "Improvements in the reconstruction of time-varying gene regulatory networks: dynamic programming and regularization by information sharing among genes" (submitted to Bioinformatics, 2010) Authors: Marco Grzegorczyk and Dirk Husmeier This ReadMe.txt file describes how our Matlab implementaion of the regularized cpBGe model can be used. The cpBGe model is a 'class 2' model, which was proposed in Grzegorczyk and Husmeier (NIPS 22, 2009, 682-690). Details are given in our supplementary paper. The implementation here is the one proposed in our Bioinformatics main paper with 2 methodological improvements based on dynamic programming and regularization by information sharing among genes. This ReadMe.txt file consists of two parts: Part 1 deals with the input and output arguments. Part 2 gives a worked toy example. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % PART 1 - OUR IMPLEMENTATION %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Note that our Matlab implementation of the regularized cpBGe model requires the 'Statistics Toolbox' for Matlab. There is only one function that has the following synthax: [Run] = PROC_cpBGe_regularized(data, steps, p, k); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Input arguments: 1) data is the data matrix where each row corresponds to a network node (gene) and each column corresponds to a time point. E.g.: data(i,t) is the realisation of node X_i at time point t 2) 'steps' is the number of Gibbs sampling steps to be performed 3) 'p' and 'k' are the hyperparameters of the negative binomial distribution underlying the point process prior for changepoints (01) step? We have to look at: LOG_SCORE = Run.log_score(i-1); % see 1) DAG = Run.dag{i-1}; % see 2) NODE_VEC = Run.Node_Vec{i-1}; % see 3) MATRIX = Run.matrix{i-1}; % see 4) 1) LOG_SCORE is the log-score (i.e. log(likelihood*prior)) of the state (DAG,NODE_VEC,MATRIX) where 2) DAG is a N-by-N matrix where N is the number of nodes DAG(k,j)=1 means that there was an edge from X_k to X_j DAG(k,j)=0 means that there was no edge from X_k to X_J in the i-th sampled graph 3) NODE_VEC is a vector of length N where N is the number of nodes. NODE_VEC(j) is the cluster to which the j-th node has been clustered MATRIX is the N-by-(m-1) allocation vector matrix where N is the number of nodes and m is the number of observations (time points). Please note that in a dynamic Bayesian network (with a time lag of 1) the first time point t=1 cannot be scored, since there are no realisations of the potential parent nodes at t=0. That is why MATRIX has (m-1) columns only. The first column corresponds to the second time point t=2 and the last column corresponds to the last time point t=m. 4) The i-th row of MATRIX gives the allocation vector of the i-th cluster. E.g.: MATRIX(i,t)=k means that the (t+1)-th time point has been allocated to the k-th mixture component (segment) for the i-th cluster. To find out which nodes belong to the k-th cluster it can be used: nodes_in_cluster_k = find(NODE_VEC==k); Please note that 'nodes_in_cluster_k' may be an empty set, if there is no node in cluster k. This happens if the number of clusters c is lower than the number of nodes, symbolically c X2 % X1 -> X3 % and that we have sampled (X1,X2,X3) at % m=20 time points. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % To make the results re-producible: rand('state',13234); % First, we generate some 'class 1' data: % We assume that there is one single changepoint at t=11 % that is common to all three nodes X1, X2, and X3 % ('class 1 data') data_1 = zeros(3,20); % Data for X1 without parents data_1(1,:) = [randn(1,10)-2,randn(1,10)+2]; data_1(2,1) = randn(1); data_1(3,1) = randn(1); for t=2:10 data_1(2,t) = (-1)*data_1(1,t-1) + 0.3*randn(1); data_1(3,t) = (-2)*data_1(1,t-1) + 0.3*randn(1); end % cp at t=10 (simply switch the signs of the coefficients) for t=11:20 data_1(2,t) = (+1)*data_1(1,t-1) + 0.3*randn(1); data_1(3,t) = (+2)*data_1(1,t-1) + 0.3*randn(1); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % We also generate some 'class 2 data': % We assume that % (i) there is a changepoint at t=11 that is common to X1 and X2, and % (ii)there are two changepoints for X3 at t=6 and t=15 % ('class 2 data') data_2 = zeros(3,20); % Data for X1 without parents data_2(1,:) = [randn(1,10)+1,randn(1,10)-1]; data_2(2,1) = randn(1); data_2(3,1) = randn(1); % Data for X2 without parents for t=2:10 data_2(2,t) = (-1)*data_2(1,t-1) + 0.3*randn(1); end % For X2 there is a cp after t=10 for t=11:20 data_2(2,t) = (+1)*data_2(1,t-1) + 0.3*randn(1); end for t=2:10 data_2(2,t) = (-1)*data_2(1,t-1) + 0.3*randn(1); end % Data for X3 % there are two changepoints at t=6 and t=15 for t=2:6 data_2(3,t) = (+2)*data_2(1,t-1) + 0.3*randn(1); end for t=7:15 data_2(3,t) = (-2)*data_2(1,t-1) + 0.3*randn(1); end for t=16:20 data_2(3,t) = (+2)*data_2(1,t-1) + 0.3*randn(1); end %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Set the parameters of the negative binomial... p = 0.0001; k = 2; % ...and let's perform 10 Gibbs sampling steps: steps = 10; % Run the Gibbs sampler for data_1 and data_2: [Run_1] = PROC_cpBGe_regularized(data_1, steps, p, k); [Run_2] = PROC_cpBGe_regularized(data_2, steps, p, k); % During the MCMC simulation the step index i can be seen % on the screen (i=1 to i=steps+1) % The MCMC simulations for data_1 and data_2 should have % finished in some minutes. % We look at the marginal edge posterior probabilities: [N,m] = size(data_1); DAG_1 = zeros(N,N); DAG_2 = zeros(N,N); for i=2:(steps+1) DAG_1 = DAG_1 + Run_1.dag{i}; DAG_2 = DAG_2 + Run_2.dag{i}; end DAG_1 = DAG_1/steps; DAG_2 = DAG_2/steps; DAG_1 DAG_2 % It can be seen that the correct edges X1->X2 and X1->X3 % have obtained the greatest posterior probabilities. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % There are three plot types that may be helpful: %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % First, we monitor the log scores figure(1) clf subplot(2,1,1) plot(Run_1.log_score) subplot(2,1,2) plot(Run_2.log_score) % For both data sets a plateau is reached. There is no indication % AGAINST convergence. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Second, we look at a heatmap of the connectivity structures % between nodes. That is, we want to visualise the clusters of % nodes: C_1 = zeros(N,N); C_2 = zeros(N,N); for i=2:(steps+1) VEC_1 = Run_1.Node_Vec{i}; VEC_2 = Run_2.Node_Vec{i}; for j=1:3 for k=1:3 c1_j = VEC_1(j); c1_k = VEC_1(k); c2_j = VEC_2(j); c2_k = VEC_2(k); if(c1_j==c1_k) C_1(j,k) = C_1(j,k)+1; end if(c2_j==c2_k) C_2(j,k) = C_2(j,k)+1; end end end end C_1 = C_1/steps; C_2 = C_2/steps; figure(2) clf subplot(1,2,1) imagesc(C_1,[0,1]); colormap(gray) subplot(1,2,2) imagesc(C_2,[0,1]); colormap(gray) % In the second figure there are two panels with heatmaps % for data_1 (left panel) and data_2 (right panel). % A grey shading is used to indicate % the connectivity probabilities between nodes % (black=0 and C=1 white) % From the left panel (for data_1) it can be seen that the three % nodes have been clustered to the same cluster. The heatmap is % white indicating that there are high connectivities between the % three nodes (i.e the number of clusters is c=1). % From the right panel (for data_2) it can be seen that the third % node is seperated from the first two nodes X1 and X2. That is, % the nodes X1 and X2 build the first cluster, and X3 is seperated % from them in a second cluster. It follows that the number of % clusters is c=2. % We note that these findings are consistent with the data # % generation mechanisms described above. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Finally, we want to find out which changepoints have been inferred for the nodes % in the clusters [N,m] = size(data_1); m = m-1; % the first time point t=1 is not allocated C1_1 = zeros(m,m); C1_2 = zeros(m,m); C1_3 = zeros(m,m); C2_1 = zeros(m,m); C2_2 = zeros(m,m); C2_3 = zeros(m,m); for i=2:(steps+1) mat_1 = Run_1.matrix{i}; vec_1 = Run_1.Node_Vec{i}; mat_2 = Run_2.matrix{i}; vec_2 = Run_2.Node_Vec{i}; index_1_1 = vec_1(1); index_1_2 = vec_1(2); index_1_3 = vec_1(3); index_2_1 = vec_2(1); index_2_2 = vec_2(2); index_2_3 = vec_2(3); VEC_1_1 = mat_1(index_1_1,:); VEC_1_2 = mat_1(index_1_2,:); VEC_1_3 = mat_1(index_1_3,:); VEC_2_1 = mat_2(index_2_1,:); VEC_2_2 = mat_2(index_2_2,:); VEC_2_3 = mat_2(index_2_3,:); for x=1:m for y=1:m if(VEC_1_1(x)==VEC_1_1(y)) C1_1(x,y) = C1_1(x,y)+1; end if(VEC_1_2(x)==VEC_1_2(y)) C1_2(x,y) = C1_2(x,y)+1; end if(VEC_1_3(x)==VEC_1_3(y)) C1_3(x,y) = C1_3(x,y)+1; end if(VEC_2_1(x)==VEC_2_1(y)) C2_1(x,y) = C2_1(x,y)+1; end if(VEC_2_2(x)==VEC_2_2(y)) C2_2(x,y) = C2_2(x,y)+1; end if(VEC_2_3(x)==VEC_2_3(y)) C2_3(x,y) = C2_3(x,y)+1; end end end end C1_1 = C1_1/steps; C1_2 = C1_2/steps; C1_3 = C1_3/steps; C2_1 = C2_1/steps; C2_2 = C2_2/steps; C2_3 = C2_3/steps; figure(3) clf subplot(3,2,1) imagesc(C1_1,[0,1]); colormap(gray) subplot(3,2,2) imagesc(C2_1,[0,1]); colormap(gray) subplot(3,2,3) imagesc(C1_2,[0,1]); colormap(gray) subplot(3,2,4) imagesc(C2_2,[0,1]); colormap(gray) subplot(3,2,5) imagesc(C1_3,[0,1]); colormap(gray) subplot(3,2,6) imagesc(C2_3,[0,1]); colormap(gray) % The plot is arranged as a 3-by-2 matrix. % The first column corresponds to data_1, and % the second column corresponds to data_2. % The rows correspond to the three network nodes: % first row -> X1 % second row -> X2 % third row -> X3 % In each of the six panels there is a heatmap, % and a grey shading is used to indicate % the connectivity probabilities between time points % (black=0 and white=1). % From the left column (for data_1) it can be seen that % the three nodes X1, X2 and X3 share the same changepoint % at t=10. % From the right column (for data_2) it can be seen that % the first two nodes X1 and X2 share the same changepoint % at t=10. For node X3 two changepoint at t=6 and t=15 have % been inferred. % From the right panel (for data_2) it can be seen that the third % node is seperated from the first two nodes X1 and X2. That is, % the nodes X1 and X2 build the first cluster, and X3 is seperated % from them in a second cluster. It follows that the number of % clusters is c=2. % We note that these findings are consistent with the data % generation mechanisms described above. % data_1 -> class 1 data (all nodes share the same changepoint at t=10) % data_2 -> class 2 data % (the nodes X1 and X2 share the same changepoint at t=10 and % node X3 has two changepoints at t=6 and t=15). % END OF WORKED EXAMPLE