Tutorial: Analysis of Complex Traits using R: Statistical Applications

Andrea S. Foulkes, School of Public Health and Health Sciences, University of Massachusetts, Amherst, MA, USA.


This tutorial introduces fundamental statistical concepts and analysis tools for characterizing genotype-trait associations in population-based studies. Topics include testing for linkage disequilibrium (LD) and Hardy Weinberg equilibrium (HWE), adjusting for multiplicity, accounting for phase ambiguity and addressing high-dimensionality through machine learning algorithms. Emphasis is on the mathematical representations of concepts, while providing practical R tools for implementing the proposed strategies.


LD and HWE.A general introduction to genetic association studies is provided, including typical data components and the overarching analytical challenges inherent in these investigations. Basic genetic vocabulary and concepts including linkage disequilibrium (LD) and Hardy Weinberg equilibrium (HWE) are described as well as the R-package genetics and corresponding functions for formally testing for the presence of LD and HWE.

Multiple testing adjustments. The second portion of the tutorial focuses on applications of several multiple comparison procedures that address the multiplicity problem inherent in most genotype-trait association studies. These include both single-step and step-down adjustments (e.g. Bonferroni and false discovery rate control) as well as resampling-based methods (e.g. the approaches of Westfall and Young, 1993 and Pollard and van der Laan, 2004.) This portion describes applications of existing functions and packages, including p.adjust() and qvalue, as well as alternative, simple coding examples for making appropriate adjustments.

Accounting for ambiguity in phase. Haplotype reconstruction techniques are typically applied to population-level association data, in which allelic phase is generally unobservable. In the third section of this tutorial, focus is placed on one such approach that uses an expectation-maximization type algorithm. This approach can be based solely on observed genotype data or additionally incorporate information on a quantitative trait. Here emphasis is placed on the application of functions within the haplo.stats package.

High-dimensional data methods. Finally, in the fourth part of the tutorial, two approaches and associated tools for handling the high-dimensional aspect of the data, namely random forests (RFs) and mulitivariate adaptive regression splines (MARS), are described for the genetics setting. Both RFs and MARS are machine learning approaches that represent extensions of the classification and regression tree methodology. Here, applications of functions within the R-packages randomForest and Earth are provided.

Several publicly available data sets are used in this tutorial to aid in the illustration of analytic tools. Applications focus on the rapidly expanding field of public health and medical research investigations of complex disease genotype-trait associations in unrelated individuals. Particular attention is given to appropriate handling of population-level environmental factors that confound or modify associations of interest. This tutorial will offer participants a basic introduction to genetic association studies and fundamental knowledge of the broad and powerful spectrum of tools R offers for addressing an array of analytical challenges inherent in these investigations.


Elementary knowledge of statistical concepts at the level of a first course in biostatistics is assumed. This tutorial is intended to appeal to public health and medical researchers involved in genetic investigations, as well as biologists, statisticians and computer scientists with interests in bioinformatic tools. Finally, the content of this tutorial is based on a textbook entitled Applied Statistical Genomics in R (Springer UseR! Series, September 2008.)