Top 5 most influential works in data science?

Error: Unable to create directory uploads/2026/07. Is its parent directory writable by the server?

As part of the Data-Driven Discovery Investigator Competition from the Gordon and Betty Moore Foundation, they ask for

five references to the most influential work in data science in the applicant’s view. This is distinct from the bio-sketch references and will not be factor in the Foundation’s decision-making. This information will help the Foundation better understand the influential ideas related to data-driven discovery and data science.

After talking to others in the lab, below is my list, sorted in order of citations according to Google Scholar. Love to hear comments on these and/or suggestions of others I missed.

1. Lander, E. S., Linton, L. M., Birren, B., Nusbaum, C., Zody, M. C., Baldwin, J., … & Grafham, D. (2001). Initial sequencing and analysis of the human genome. Nature, 409(6822), 860-921. (16,000 citations)

The Human Genome Project turned the secret of life into into digital information. On January 14, 2014 Illumina announced a new sequencing machine that can do the wet lab processing of a genome for $1000. This price is widely believed to be a tipping point, and soon millions will have their genomes sequenced. At 25 to 250 gigabytes per genome, genetics is now Big Data.

2. Dean, J., & Ghemawat, S. (2008). MapReduce: simplified data processing on large clusters. Communications of the ACM, 51(1), 107-113. (9,200 citations)

A simple, easy-to-use programming model to process Big Data. It led to the No-SQL movement, Hadoop, many startup companies, and awards for its authors.

3. Blei, D., Ng, A., & Jordan, M. I. (2003). Latent Dirichlet allocation. Journal of Machine Learning Research, 3, 993-1022. (7,300 citations)

LDA allows sets of observations to be explained by unobserved groups. It spawned an entire industry of data-driven discovery for text and image corpora.