Statistik: mehr als Erbsen zählen

You are here:

Project R3: Statistical assessment of gene-exposure and gene-gene interactions

In project R3, genetic risk scores will be further developed for the statistical assessment of gene-exposure and gene-gene interactions.

Genetic risk scores are weighted sums over genetic factors, in particular, SNPs, that belong to, e.g., the same gene or pathway. The weights are usually estimated using regularised regression procedures such as lasso or elastic net (Hüls et al., 2017a). Such genetic risk scores are particularly effective if – as, e.g., in the analysis of SNPs – the explanatory variables are highly correlated and many of the considered factors have no or only slight influence on the outcome of interest such as the disease status or the age. Using genetic risk factors to jointly consider a set of SNPs can, hence, lead to an improved detection of associations of genes and pathways, respectively, with the outcome of interest (Hüls et al., 2017a).

After generating the genetic risk scores, their values are computed for the individuals participating in a study. Then, a regression model is applied to these values to examine the influence of the SNPs belonging to the respective gene or pathway on the outcome. Furthermore, it can be tested whether interactions between genetic risk scores and environmental factors are associated with this outcome. Such an analysis of gene-environment interactions based on genetic risk scores was already applied, e.g., in the SALIA study (Hüls et al., 2017c).

The weights of the genetic risk scores should be estimated on a different data set than the one on which the association analysis is performed. In a comprehensive simulation study, we have investigated the best way to generate and employ genetic risk factors in the analysis of geneenvironment interactions (Hüls et al., 2017c). In this study, we, e.g., analysed whether the environmental factors should already be taken into account in the construction of the genetic risk factors. The simulation study showed that in this construction the environmental factor should only be considered when the gene-environment interaction shows a stronger association with the outcome than the corresponding genetic risk factors. The study also took into account the more realistic situation that no external independent data set is available for the generation of the genetic risk scores. In this situation, it is necessary to divide the data set into a part, on which the weights of the genetic risk scores are estimated, and a part, on which the association analyses are performed. The analysis showed that employing such a training and test set strategy reduces the statistical power only slightly in comparison to using external weights, i.e. weights estimated on an external, independent data set.

In this project, it will be investigated in a similar fashion, which procedure is most adequate for the generation of genetic risk scores, in the analysis of gene-exposure and gene-gene interactions, respectively, as they are of interest in, e.g., the SALIA study. For this purpose, the genetic risk scores will be adapted to these situations. Considering certain age-associated diseases as outcome, genetic risk scores will be estimated for the association analysis of interactions of each of several exposures with all SNPs belonging to a gene or pathway, respectively. Similarly, for the assessment of the interactions between different genes or pathways, SNP-based genetic risk scores will be constructed for each gene and pathway separately. Afterwards, these estimated genetic risk scores will be employed to test the genes and pathways for main and interaction effects, considering regression models that contain terms for both the genetic risk scores and their interaction.

In the construction of the genetic risk scores, not only the regularised regression procedures already established in the generation of genetic risk scores for association analysis of main effects and gene-environment interactions will be used. In addition, also other regression methods such as logic regression (Ruczinski et al., 2003) and Random Forests (Breiman, 2001) will be considered. These procedures take into account interactions between SNPs already at the SNP level, i.e. in the construction of the genetic risk score. The predictive power of these methods will be compared with the power of the already established methods.


  • Breiman L (2001). Random Forests. Machine Learning 45, 5-32, doi: 10.1023/A:1010933404324.
  • Hüls A, Ickstadt K, Schikowski T, Krämer U (2017a). Detection of gene-environment interactions in the presence of linkage disequilibrium and noise by using genetic risk scores with internal weights from elastic net regression. BMC Genet 18, 55, doi: 10.1186/s12863-0170519-1.
  • Hüls A, Krämer U, Carlsten C, Schikowski T, Ickstadt K, Schwender H (2017b). Comparison of weighting approaches for genetic risk scores in gene-environment interaction studies. BMC Genet 18, 115, doi: 10.1186/s12863-017-0586-3.
  • Hüls A, Krämer U, Herder C, Fehsel K, Luckhaus C, Stolz S, Vierkötter A, Schikowski T (2017c). Genetic susceptibility for air pollution-induced airway inflammation in the SALIA study. Environ Res 152, 43-50, doi: 10.1016/j.envres.2016.09.028.
  • Ruczinski I, Kooperberg C, LeBlanc M (2003). Logic regression. J Computational and Graphical Statistics 12, 475-511, doi: 10.1198/1061860032238.