BerkeleyX Data Science on Apache Spark MOOC starts today

Error: Unable to create directory uploads/2025/12. Is its parent directory writable by the server?

For the past several months, we have been working to produce two freely available Massive Open Online Courses (MOOCs). We are proud to announce that both MOOCs are launching this month on the BerkeleyX platform!

Today we launched the first course, CS100.1x, Introduction to Big Data with Apache Spark, a brand new five-week long course on Big Data, Data Science, and Apache Spark with nearly 57,000 students (UC Berkeley’s 2014 enrollment was 37,581 students).

The first course eaches students about Apache Spark and performing data analysis. The course assignments include Log Mining, Textual Entity Recognition, and Collaborative Filtering exercises that use real-world data to teach students how to manipulate datasets using parallel processing with PySpark.

The second course, called Scalable Machine Learning, will begin on June 29th and will introduce the underlying statistical and algorithmic principles required to develop scalable machine learning pipelines, and provides hands-on experience using Spark.

We would also like to thank the Spark community for their support. Several community members are serving as teaching assistants and beta testers, and multiple study groups have been organized by community members in anticipation of these courses.

Both courses are available for free on the edX website, and you can sign up for them today:

For students who complete a course, the courses offer the choice of free Honor Code completion certificates or paid edX IDVerified Certificates

The courses are sponsored in part by the AMPLab and Databricks.