MIT Programming Systems Research Group
  menu

"Using HDA Data to Statistically Validate Models of Genetic Regulatory Networks in S. cerevisiae"

We aim to elucidate S. cerevisiae genetic regulatory networks through the analysis of high-density DNA array (HDA) data. Existing techniques for analyzing HDA data do not focus on the statistical testing of hypotheses about either the functioning of complex multi-variate systems or the form of complex regulatory networks. Typically, single-factor analysis is performed by examining the fold increase or decrease in expression of a target gene or collection of genes, and graphical visualization is used to demonstrate coordinated patterns of expression. Moreover, noise in HDA data is typically not analyzed in detail and thus the significance of alternative conclusions from HDA studies cannot be directly quantitatively compared. Finally, a single framework currently does not exist that permits theories to describe latent variables such as protein levels, and make predictions that can later be verified as those data become available.

We address these problems by representing biological theories in computational form. Theories represented computationally have the advantage that statistical metrics can be used to compare the predictive power of different theories in the presence of observed data. Once a theory about a biological system is represented computationally, the theory can be automatically tested, refined, stored, used as a component in other models, and communicated to others. The need to represent theories in computational form raises many interesting questions about appropriate knowledge representation languages for molecular biology. We have chosen graphical models as the basis of our approach. Graphical models form a class of flexible and interpretable models for representing probabilistic relationships among variables of interest.

One key feature of a graphical model representation is that a single model can simultaneously contain knowledge at varying levels of refinement, from the qualitative dependence relations captured in a graph, to the more quantitative description of dependencies among the variables of interest. The level of refinement need not be uniform within a model; a single model can represent different network relationships at different levels of refinement. Another key feature is that the specification and use of these models is not limited to cases where all the variables are observed; graphical models permit the inclusion of latent variables for representing components of regulatory networks that are currently unobservable or unobserved, such as protein levels.

Most importantly, using graphical models enables us to statistically compare the validity of S. cerevisiae genetic regulatory network models in the context of experimental data. We have developed a simple illustration of how graphical models can be used to represent biological theories and statistically rigorously compare alternative hypotheses about particular mechanisms in genetic regulatory networks. In our illustration, we examine two published hypotheses regarding the role of Gal80 protein in the yeast galactose system. Using 52 genomes worth of yeast HDA data, we have shown how our statistical metrics disambiguate the two hypotheses, in favor of the currently accepted one.

 horizontal rule

Programming Systems Research Group, MIT

For more information or comments on our web site, please contact webmaster@psrg.lcs.mit.edu.
For information regarding our group, feel free to contact us.