Disk-Locality in Datacenter Computing Considered Irrelevant

A popular approach to improve the efficiency of cluster computing frameworks has been to increase disk-locality – the fraction of tasks that run on nodes that have the task’s input data stored on local disk. This paper takes the position that disk-locality is going to be irrelevant in cluster computing, and considers the implications this will have on datacenter computing research. The quest for disk-locality is based on two assumptions: (a) disk bandwidths exceed network bandwidths, (b) disk I/O constitutes a considerable fraction of a task’s lifetime. We show that current trends undermine both these assumptions.