Large scale data analysis made easier with SparkR

Error: Unable to create directory uploads/2024/11. Is its parent directory writable by the server?

R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

Some of the key features of SparkR include:

RDDs as Distributed Lists: SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD. In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

Serializing closures: SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster.

Using existing R packages: SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster.

Putting these features together in R can be very powerful. For example, the code to compute Logistic Regression using gradient descent is listed below. In this example, we read a file from HDFS in parallel using Spark and run a user-defined gradient function in parallel using lapplyPartition. Finally the weights from different machines are accumulated using reduce.

pointsRDD <- readMatrix(sc, "hdfs://myfile")
# Initialize weights
weights <- runif(n=D, min = -1, max = 1)
# Logistic gradient
gradient <- function(partition) {
    X <- partition[,1]; Y <- partition[-1]
    t(X) %*% (1/(1 + exp(-Y * (X %*% w))) - 1) * Y
}
for (i in 1:10) {
    weights <- weights - reduce(lapplyPartition(pointsRDD, 
        gradient), "+")
}

Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at http://amplab-extras.github.io/SparkR-pkg.