Tutorial: Distributed Data Analysis using R

Stefan Rüping, Fraunhofer IAIS, Sankt Augustin, Germany.
Michael Mock, Fraunhofer IAIS, Sankt Augustin, Germany.
Dennis Wegener, Fraunhofer IAIS, Sankt Augustin, Germany.

Slides available

Follow this link.


In the last couple of years, the amount of data to be analysed in different areas grows rapidly. Examples range from natural sciences (e.g. astronomy or particle physics), business data (e.g. a high increase use data volume is expected by the use of RFID technology), life sciences (such as high-throughput genomics and post-genomics technologies) or data generated by normal users on the internet (see Google, Youtube, etc.). The enormous growth of the amount of data is complemented by advances in distributed computing technology enabling the data analyst to handle this amount of data in reasonable time. Two main streams of current distributed technology development and research are particularly useful in this respect: the GRID technology is aiming at making data stores and computing facilities which are geographically widely spread available for a common, global data analysis. The other stream of development is cluster-based computing which transforms large amounts of standard computers into high-performance computing bases.

However, even if the above mentioned advances in distributed computing technology make available the computing and storage resources for handling large amounts of data, they introduce another level of complexity in the system, such that the traditional data analyst, with a strong background in statistics and application domain knowledge, might be overwhelmed by the complexity of the underlying distributed technology. For instance, an application developer using R might not be interested in any details of how web services are built. Therefore, ongoing research aims at bridging the gap between advanced distributed computing technology and traditional statistical software.

The goal of this tutorial is to inform the statistician, especially those using the R language, about current trends in distributed computing technology and to show ways how to use and integrate R programs in distributed environments considering both, GRID and cluster-based computing. As a particular challenging example, we will, among other, report from the Advancing Clinico-Genomics Trials on Cancer project (ACGT) which aims at providing a data analysis environment that allows the exploitation of an enormous pool of data collected in European cancer treatments.

In the context of this project, the GridR package was developed, which was one of the first attempts to connect R to a grid environment - to grid-enable R. We will give an introduction into distributed data analysis and data exchange in the context of R and a detailed description of the gridR package. Then, we will show a real world example of distributed data analysis using R, referring to a scenario from the clinico-genomic area in the context of the ACGT project.

Goal of the tutorial is to make the attendees familiar with the principles of distributed computing, discuss relevant R packages (such as GridR) that provide access to distributed computing environments. People will learn how to make use of a distributed environment from their local programs.


Required Knowledge


The tutorial will contain practical exercises to give the participants opportunity to get a firsthand experience of distributed data analysis, and even give them the opportunity to test out how to distribute their own programs.