AMPLab Alum Reynold Xin posted an article on the Databricks blog describing recent performance results for large-scale sorting using Spark on a public cloud (AWS) cluster. The blog reports the following numbers:
- A 100TB Sort in 23 minutes on 206 machines
- A 1PB Sort in less than 4 hours on 190 machines
These numbers beat previous published results including the current “Daytona Gray” 100 TB benchmark “world record” (held by a Yahoo! Hadoop cluster – 72 minutes on 2100 machines).
The benchmark was done on data stored on HDFS (not cached in memory). See Reynold’s post for details on the results and implementation.