Facebook-class social network analysis with R and Hadoop
In computing, social networks are traditionally represented as graphs: a connection of nodes (people), pairs of which may be connected by edges (friend relationships). Visually, the social networks can then be represented like this:
Social network analysis often amounts to calculating the statistics on a graph like this: the number of edges (friends) connected to a particular node (person), and the distribution of the number of edges connected to nodes across the entire graph. When the graph consists of up to 10 billion elements (nodes and edges), such computations can be done on a single server with dedicated graph software like Neo4j. But bigger networks — like Facebook's social network, which is a graph with more than 60 billion elements — require a distributed solution.
Marko A. Rodriguez, a graph consultant with Aurelius, shows in a blog post how to use R and Hadoop (integrated with Revolution Analytics' RHadoop packages) to analyze Facebook-scale social networks. He first simulates a social network (shown at the top of this post) using R's igraph package, and then distributed the network in the Hadoop cluster with to.dfs function (from the rhdfs package). He then used the mapreduce function (from the rmr package) to write a simple map-reduce algorithm in R to count the number of edges associated with each node:
From there, it's another simple map-reduce job to calculate the connectivity statistics for the entire network. For more details on how Marko used RHadoop to perform this analysis, see the entire blog post linked below.
Aurelius blog: Graph Degree Distributions using R over Hadoop