Parallel R: Quick Ways Model More

Introduction

I am less and less often mistaken for a pirate when I mention the R language. While I miss the excuse to wear an eyepatch, I'm glad more people are beginning to explore a statistical language I've been touting for years. When it comes to plotting or running complex statistics in a single line of code, R is a great tool to have. That said, there are plenty of pitfalls for the casual or new user: syntax, learning to write vectorized code, or even just knowing which "apply" function you really should choose.

I want to explore a slightly less-often considered aspect of R development: parallelism. Out of the box, R can seem very limited to someone used to working on compute clusters or even a multicore server. However, there are a few tricks we can leverage to get the most out of R on everything from a personal workstation to a Hadoop cluster.

R is Single-Threaded

The R interpreter is -- and likely always will be -- single-threaded. This means loading data frames is done in a single thread. So is building your linear model, or generating that pretty surface plot. Even on my laptop, that's a lot of threads to not use for modeling. No matter how much my web browser might covet those cycles, I'd like to use them for work.

Rather than a complex multithreaded re-implementation, the R interpreter offers a number of ways to allow users to selectively apply parallelism. Some of these approaches leverage MPI libraries and mirror that message passing approach. Others allow a more implicit parallelism via "foreach" or "apply" constructs. We'll just focus on a pair of strategies using the parallelism that's been included in R since it's 2.14.1 version: the parallel library.

Setting The Stage for Parallel Execution

We're going to need to load a few libraries into our R session before we can execute anything outside of our single-thread. We'll use the doParallel and foreach because they allow us to focus on what to parallelize rather than how to coordinate our threads.


  data(iris)

  library(parallel)

  library(iterators)

  library(doParallel)

  library(foreach)

Knowing that calculations in R will be single-threaded, we want to use the parallel package to operate on logical subsets of the data simultaneously. For example, I loaded a set of data about Iris which contains a number of different species. One way I might want to parallelize is to fit the same each species simultaneously. For that, I'm going to have to split the data by species:

> species.split <- split(iris, iris$Species)

This gives us a list we can iterate over -- or parallelize. From here on out, it's simply a question of deciding what resources we want to leverage: local CPUs or remote hosts.

FORKs and SOCKs

We're going to use the makeCluster function to bind together a set of computational resources. But first we need to decide: do we want to use only local CPUs, or is it necessary to open up socket connections to other machines distribute our workload? In the former case we'll use makeCluster to create what's called a FORK cluster (in that it uses UNIX's fork call to create slaves). In the latter, we'll create a SOCK cluster by opening up sockets to a list of remote hosts and starting slave processes on them.

Here's a FORK cluster which uses all my cores:

> cl <- makeCluster(detectCores()) registerDoParallel(cl)

And here's a SOCK cluster across three nodes (password-less SSH is required)

> hostlist <- c("10.0.0.1", "10.0.0.2", "10.0.0.3") cl <- makeCluster(hostlist) registerDoParallel(cl)

In each case, I call registerDoParallel to bind this cluster to the %dopar% operator. This is the operator which will let us easily iterate in parallel.

Running in Parallel

Once we've got something to iterate over and a cluster with which to do it, modeling in parallel becomes straightforward. Suppose I want to fit a model of sepal length as a linear combination of petal characteristics. In that case, the code is simply:

> species.models <- foreach(i=species.split) %dopar% { m<-lm(i$Sepal.Length ~ i$Petal.Width*i$Petal.Length); return(m) }

But I'm not just restricted to fitting linear models on my little cluster. I can run k-means clustering for several different k simultaneously using basically the same block:

> species.clusters<- foreach(i=2:5) %dopar% { km <- kmeans(iris, i); return(km) }

When I'm done with my block, I can just call stopCluster(cl) to ensure my processes terminate and I'm not hogging resources.

Using Hadoop

Finally, there will be situations in which I need to deploy in parallel against much larger datasets -- specifically, datasets stored in HDFS. Both Hive and Pig will let me run an R script as part of a streaming process. In Hive, the TRANSFORM operator will send data to an R Script. In Pig, you can use the STREAM operator to send a whole bag to an R script. However, you can't stream from within Pig's FOREACH blocks, so I occasionally use a UDF which invokes R scripts for me.

Regardless of the method you choose to send HDFS data to an R process, it's important to make sure your R script can consume data streaming from standard input. I find the most expedient way of doing this via the file function. A typical script might start:

#! /usr/bin/env Rscript #Connection to STDIN for reading a data frame con <- file(description="stdin") my.data.frame <- read.table(con, header=FALSE, sep=",")

`Summary`

We've covered several ways to push R beyond the the bounds of its single-threaded core. There are forking and socket mechanisms for spreading our work around, not to mention tricks for leveraging the power of Hadoop Streaming. In each case, however, one thing stands out: we must be smart as modelers and understand what can and should be done in parallel.

Parallel R: Quick Ways Model More

Introduction

R is Single-Threaded

Setting The Stage for Parallel Execution

FORKs and SOCKs

Running in Parallel

Using Hadoop

`Summary`

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112