Getting Started with R and Hadoop
Last week's meeting of the Chicago area Hadoop User Group (a joint meeting the Chicago R User Group, and sponsored by Revolution Analytics) focused on crunching Big Data with R and Hadoop. Jeffrey Breen, president of Atmosphere Research Group, frequently deals with large data sets in his airline consulting work, and R is his "go-to tool for anything data-related". His presentation, "Getting Started with R and Hadoop" focuses on the RHadoop suite of packages, and especially the rmr package to interface R and Hadoop. He lists four advantages of using rmr for big-data analytics with R and Hadoop:
- Well-designed API: code only needs to deal with basic R objects
- Very flexible I/O subsystem: handles common formats like CSV, and also allows complex line-by-line parsing
- Map-Reduce jobs can easily be daisy-chained to build complex workflows
- Concise code compared to other ways of interfacing R and Hadoop (the chart below compares the number of lines of code required to implement a map-reduce analysis using different systems)
For newcomers to map-reduce programming with R and Hadoop, Jeffrey's presentation includes a step-by-step example of computing flight times from air traffic data. The last few slides some advanced features: how to work directly with files in HDFS from R with the rhdfs package; and how to simulate a Hadoop cluster on the local machine (useful for development, testing and learning RHadoop). Jeffrey also mentions that the RHadoop tutorial is a good resource for new users.
You can find Jeffrey's slides embedded below, and a video of the presentation is also available. You might also want to check out Jeffrey's older presentation Big Data Step-by-Step for tips on setting up a compute environment with Hadoop and R.