Large scale data analysis made easier with SparkR

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

R is a widely used statistical programming language and supports a variety of data analysis tasks through extension packages. In fact, a recent survey of data scientists showed that R is the most frequently used tool other than SQL databases. However, data analysis in R is limited as the runtime is single threaded and can only process data sets that fit in a single machine.

In an effort to enable large scale data analysis from R, we have recently released SparkR. SparkR is an R package that provides a light-weight frontend to use Spark from R. SparkR allows users to create and transform RDDs in R and interactively run jobs from the R shell on a Spark cluster. You can can try out SparkR today by installing it from our github repo.

Some of the key features of SparkR include:

RDDs as Distributed Lists: SparkR exposes the RDD API of Spark as distributed lists in R. For example we can read an input file from HDFS and process every line using lapply on a RDD. In addition to lapply, SparkR also allows closures to be applied on every partition using lapplyWithPartition. Other supported RDD functions include operations like reduce, reduceByKey, groupByKey and collect.

Serializing closures: SparkR automatically serializes the necessary variables to execute a function on the cluster. For example if you use some global variables in a function passed to lapply, SparkR will automatically capture these variables and copy them to the cluster.

Using existing R packages: SparkR also allows easy use of existing R packages inside closures. The includePackage command can be used to indicate packages that should be loaded before every closure is executed on the cluster.

Putting these features together in R can be very powerful. For example, the code to compute Logistic Regression using gradient descent is listed below. In this example, we read a file from HDFS in parallel using Spark and run a user-defined gradient function in parallel using lapplyPartition. Finally the weights from different machines are accumulated using reduce.

pointsRDD <- readMatrix(sc, "hdfs://myfile")
# Initialize weights
weights <- runif(n=D, min = -1, max = 1)
# Logistic gradient
gradient <- function(partition) {
    X <- partition[,1]; Y <- partition[-1]
    t(X) %*% (1/(1 + exp(-Y * (X %*% w))) - 1) * Y
}
for (i in 1:10) {
    weights <- weights - reduce(lapplyPartition(pointsRDD, 
        gradient), "+")
}

Right now, SparkR works well for algorithms like gradient descent that are parallelizable but requires users to decide which parts of the algorithm can be run in parallel. In the future, we hope to provide direct access to large scale machine learning algorithms by integrating with Spark’s MLLib. More examples and details about SparkR can be found at http://amplab-extras.github.io/SparkR-pkg.

Top 5 most influential works in data science?

Error: Unable to create directory uploads/2024/05. Is its parent directory writable by the server?

As part of the Data-Driven Discovery Investigator Competition from the Gordon and Betty Moore Foundation, they ask for

five references to the most influential work in data science in the applicant’s view. This is distinct from the bio-sketch references and will not be factor in the Foundation’s decision-making. This information will help the Foundation better understand the influential ideas related to data-driven discovery and data science.

After talking to others in the lab, below is my list, sorted in order of citations according to Google Scholar. Love to hear comments on these and/or suggestions of others I missed.

    1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … & Grafham, D. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. (16,000 citations)

The Human Genome Project turned the secret of life into into digital information. On January 14, 2014 Illumina announced a new sequencing machine that can do the wet lab processing of a genome for $1000. This price is widely believed to be a tipping point, and soon millions will have their genomes sequenced. At 25 to 250 gigabytes per genome, genetics is now Big Data.

    2. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. (9,200 citations)

A simple, easy-to-use programming model to process Big Data. It led to the No-SQL movement, Hadoop, many startup companies, and awards for its authors.

    3. Blei, D., Ng, A., & Jordan, M. I. (2003).  Latent Dirichlet allocation.  Journal of Machine Learning Research, 3, 993-1022. (7,300 citations)

LDA allows sets of observations to be explained by unobserved groups. It spawned an entire industry of data-driven discovery for text and image corpora.
    4. Armbrust, M., Fox, A., Griffith, R., Joseph, A., Katz, R., Konwinski, A., Lee, G., Patterson, D., Rabkin, A., Ion Stoica & Zaharia, M. (2010). A view of cloud computing. Communications of the ACM, 53(4), 50-58. (5,800 citations)

At a time when there was confusion as to what cloud computing was, it defined cloud computing, explained why it occurred now, and listed its challenges and opportunities.

    5. Stoughton, C., Lupton, R. H., Bernardi, M., Blanton, M. R., Burles, S., Castander, F. J., … & Carey, L. (2002). Sloan digital sky survey: Early data release. The Astronomical Journal, 123(1), 485. (2,100 citations)

Aided by computer scientist Jim Gray, astronomers made raw astronomical data available to a much wider community. It led to crowd-sourcing of astronomy through projects like Galaxy Zoo, so now anyone could help with astronomy research.