Statistik: mehr als Erbsen zählen

You are here:

research / fields of work

Bioinformatics research in the post-genomic era has to cope with a flood of new typically high-dimensional data sets. The ultimate goal is a personalized medicine that uses genetic measurements from individual patients for an improved diagnosis and therapy of diseases. The high complexity and noise levels in the data require the development and application of suitable statistical models and algorithmic procedures. However, to answer biologically relevant questions, expertise in statistics and computer science has to be combined with meaningful biological modelling. We pursue three major research interests.


1.   Statistical analysis of genomic data sets for diagnosis and therapy in cancer and  HIV

Human tumors are often associated with typical genetic events like tumor-specific chromosomal alterations. Our research goals in this field are the identification of characteristic pathogenic routes in such tumors and the prediction of survival or other medically relevant times from the genetic patterns. For estimating the most likely pathways of chromosomal alterations from cross-sectional data we have developed a suitable model class within a probabilistic framework, namely mixtures of oncogenetic trees. We have introduced a method to determine the optimal number of tree components based on a new suitable BIC criterion. A current research goal is the extension of the method to high-dimensional data using appropriate feature selection techniques. We have introduced a new genetic prognostic marker, the so-called genetic progression score (GPS) that estimates the progression of a tumor based on oncogenetic tree models. Using Cox regression models we have demonstrated that the GPS better reflects tumor biology than traditional markers and can be considered a medically relevant prognostic factor. We collaborate with several medical and biological partners, with a focus on prostate cancer and different types of brain tumors. We have also applied our tree mixture models to estimate evolutionary pathways to drug resistance in HIV. The typical order of drug-specific viral mutations is estimated. This is important for the design of effective therapeutic strategies.


2.   From expression data to biological insight

The statistical analysis of high-dimensional gene expression data enjoys increasing popularity over the last years. Particularly due to the inherent noise in expression data, an effective handling also must take biological issues into account in order to gain new insights on proteins and biochemical networks. We have developed ScorePAGE, a statistical approach to scoring changes in activity of metabolic pathways from gene expression data. The method identifies the biologically relevant pathways with corresponding statistical significance. Including information about pathway topology in the score further improves the sensitivity of the method. Modern methods identify important biological processes or functions from gene expression data by scoring the relevance of predefined functional gene groups, for example based on the Gene Ontology (GO). We have developed algorithmic and statistical methods that improve the explanatory power of this approach by integrating knowledge about the dependencies between the gene groups into the calculation of the statistical significance. In comparison with state-of-the-art methods for scoring functional terms, the new algorithms point at additional areas in the GO graph with significant biological processes or functions. The algorithms have been applied to several real expression data sets from prostate cancer patients.


3.   Classification of protein structures

As proteins are the key macromolecules within a cell they play a central role for understanding biological processes. Supervised and unsupervised statistical learning techniques can help in classifying proteins according to molecular function, a key step in understanding of the molecular biology of a disease processes. Proteins typically fold into stable tertiary structures, but these structures exhibit some degree of flexibility. This flexibility is important for the function of the protein. We have developed the unsupervised method STRuster that clusters alternative models for proteins according to backbone structural similarity. The similarity measure is based on the local shape of the protein rather than on a global measure. Proteins in the same cluster are expected to correspond to similar functional states or to similar experimental conditions. A current research goal is the restriction of the clustering method to functional sites.