Introduction
I am less and less often mistaken for a pirate when I mention the R language. While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years. When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have. That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.
I want to explore a slightly less-often considered aspect of R development: parallelism. Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server. However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.
R is Single-Threaded
The R interpreter is -- and likely always will be -- single-threaded. This means loading data frames is done in a single thread. So is building your linear model, or generating that pretty surface plot. Even on my laptop, that's a lot of threads to not use for modeling. No matter how much my web browser might covet those cycles, I'd like to use them for work.
Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism. Some of these approaches leverage MPI libraries and mirror that message passing approach. Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.
Setting The Stage for Parallel Execution
We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread. We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.
data(iris)
library(parallel)
library(iterators)
library(doParallel)
library(foreach)
Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously. For example, I loaded a set of data about Iris which contains a number of different species. One way I might want to parallelize is to fit the same each species simultaneously. For that, I'm going to have to split the data by species:
>
species.split <- split(iris, iris$Species)
This gives us a list we can iterate over -- or parallelize. From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.
FORKs and SOCKs
We're going to use the makeCluster function to bind together a set of computational resources. But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload? In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves). In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.
Here's a FORK cluster which uses all my cores:
>
cl <- makeCluster(detectCores())
registerDoParallel(cl)
And here's a SOCK cluster across three nodes (password-less SSH is required)
>
hostlist <- c("10.0.0.1", "10.0.0.2", "10.0.0.3")
cl <- makeCluster(hostlist)
registerDoParallel(cl)
In each case, I call registerDoParallel to bind this cluster to the %dopar% operator. This is the operator which will let us easily iterate in parallel.
Running in Parallel
Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward. Suppose I want to fit a model of sepal length as a linear combination of petal characteristics. In that case, the code is simply:
>
species.models <- foreach(i=species.split) %dopar% {
m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length);
return(m)
}
But I'm not just restricted to fitting linear models on my little cluster. I can run k-means clustering for several different k simultaneously using basically the same block:
>
species.clusters<- foreach(i=2:5) %dopar% {
km <- kmeans(iris, i);
return(km)
}
When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.
Using Hadoop
Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS. Both Hive and Pig will let me run an R script as part of a streaming process. In Hive, the TRANSFORM operator will send data to an R Script. In Pig, you can use the STREAM operator to send a whole bag to an R script. However, you can't stream from within Pig's FOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.
Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input. I find the most expedient way of doing this via the file function. A typical script might start:
#! /usr/bin/env Rscript
#Connection to STDIN for reading a data frame
con <- file(description="stdin")
my.data.frame <- read.table(con, header=FALSE, sep=",")
Summary
We've covered several ways to push R beyond the the bounds of its single-threaded core. There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming. In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.