Spark 100TB and 1PB Sort on a Public Cloud Cluster

AMPLab Alum Reynold Xin posted an article on the Databricks blog describing recent performance results for large-scale sorting using Spark on a public cloud (AWS) cluster.   The blog reports the following numbers:

  • A 100TB Sort in 23 minutes on 206 machines
  • A 1PB Sort in less than 4 hours on 190 machines

These numbers beat previous published results including the current “Daytona Gray” 100 TB benchmark “world record” (held by a Yahoo! Hadoop cluster – 72 minutes on 2100 machines).

The benchmark was done on data stored on HDFS (not cached in memory).   See Reynold’s post for details on the results and implementation.