
In the mid-1940s, Claude Shannon (1948) invented probabilistic language models that predicted the next words in sentences using those that came before them.1 Around the same time, Warren McCulloch and Walter Pitts (1943) conceived a simple model of an artificial neuron for computation. Shortly thereafter, Herbert Robins and Sutton Monro (1951) invented the stochastic approximation method for iteratively solving equations. Today’s impressive chatbots are neural-net implementations of language models that predict next words, trained by stochastic approximation methods (Brown et al., 2021). The flat circle of machine learning from 1940 to the present saw myriad other proposals for methods and theories. All these proposals were accepted or rejected through a near-infinite sequence of competitive tests on shared data.
As a researcher steeped in the theory, practice, and history of machine learning, I was struck by David Donoho’s (2024) articulation of frictionless reproducibility—evaluation through data, code, and competition—as the core force driving progress in data science. As he always does, Donoho pithily captures what I have been fumbling to articulate for years.
Donoho defines frictionless reproducibility by three aspirational pillars. Researchers should make data easily available and shareable. Researchers should provide easily re-executable code that processes this data to desired ends. Researchers should emphasize competitive testing as a means of evaluation. Through these pillars, frictionless reproducibility encapsulates the core contributions and aspirations of data science. The future of data science rests on encouraging the widespread adoption of frictionless reproducibility, educating researchers on its best practices while mitigating its many shortcomings.
Many of my machine learning colleagues are tempted to dismiss frictionless reproducibility by declaring it obvious. ‘It’s obvious that you should split your data into training and testing.’ ‘It’s obvious that shared data sets are the best way to evaluate machine learning methods.’ ‘It’s obvious that code should be widely shared and open.’
Though frictionless reproducibility has driven machine learning since its inception, the history of machine learning research demonstrates that none of this was ‘obvious,’ neither in the 1940s nor in 2012 when contemporary deep learning emerged as the dominant methodology in machine learning.2 Many fads, techniques, and theories have come and gone in machine learning. Frictionless reproducibility is the core that has proved fundamental.
One of my favorite examples highlighted in Donoho’s essay is the data set of 50 handwritten alphabets collected by Bill Highleyman at Bell Labs in 1959 to test primitive character recognition systems.3 People love to talk about the perceptron, but what initially drove machine learning was a race for scanning text into computers.
Highleyman’s data were letters avant le lettre. Indeed, despite the formation of journals and conferences, there was a great stagnation in pattern recognition methods after 1970. Papers from the 1970s in pattern recognition operated on tiny data sets that were seldom shared. By the time people started embracing the term ‘machine learning’ in the 1980s, pattern recognition wasn’t even part of the story. Indeed, at the first International Conference on Machine Learning (ICML) in 1981, pattern recognition and data sets were nowhere to be found in the proceedings.4
The first proceedings of Advances in Neural Information Processing Systems NeurIPS (originally abbreviated NIPS) from 1987 sing in the same lofty, romantic language as today. The ideas and aspirations at NeurIPS have not changed much in the three and half decades of the conference. The first official proceedings featured titles that are timeless.5
“MURPHY: A Robot that Learns by Doing”
“How Neural Nets Work”
“Encoding Geometric Invariances in Higher-Order Neural Networks”
“Performance Measures for Associative Memories that Learn and Forget”
“An Optimization Network for Matrix Inversion”
“Constrained Differential Optimization”
“Introduction to a System for Implementing Neural Net Connections on SIMD Architectures”
If you had told me these were from 30 years later in 2017, I would have believed you.
But it also wasn’t clear at the 1987 NeurIPS that pattern recognition would end up consuming this conference. John Platt, author of the aforementioned “Constrained Differential Optimization” paper, and I discussed this a few years ago (personal communications, October 16–18, 2020). He recalled confusion and excitement:
Remember that in that era, we were deeply confused about what ML was about. We didn’t even realize it was a branch of statistics until Baum and Wilczek published their paper in N[eur]IPS 0.
Pre-1987, the neural network field was deeply confused, but in a very hopeful way (it wasn’t called ML). There was hope that neural nets would displace all computation. That it would be a new way to program, with Brain-like software and hardware. There was tremendous excitement in the air, even if we were all deeply confused. Everyone was defining their own problem and pulled in different directions.
Data set benchmarking and competitive testing took over machine learning in the late 1980s. Email and file transfer were becoming more accessible. The current specification of FTP was finalized in 1985. In 1987, a PhD student at UC Irvine named David Aha put up an FTP server to host data sets for empirically testing machine learning methods. Aha was motivated by service to the community, but he also wanted to show his nearest-neighbor methods would outperform Ross Quinlan’s decision tree induction algorithms. He formatted his data sets using the ‘attribute-value’ representation that a rival researcher, Ross Quinlan (1986), had used. And, so, the UC Irvine Machine Learning Repository was born.6
Improvements in computing greased the wheels, giving us faster computers, faster data transfer, and smaller storage footprints. But computing technology alone was not sufficient to drive progress. Friendly competition with Quinlan inspired Aha to build the UCI repository. And more explicit competitions were also crucial components of the success.
Demand by funding agencies for quantitative metrics forced artificial intelligence (AI) researchers to find consistent, reliable quantities for comparison. As Donoho observes, the other notable shift in machine learning was a demand from funding agencies for more quantitative metrics. AI had found itself in one of its perennial funding winters, and program managers demanded more ‘results’ before they would be willing to write grant checks. In 1986, DARPA Program Manager Charles Wayne proposed a speech recognition challenge where teams would receive a training set of spoken sentences and be evaluated by the word error rate their methods achieved on a hidden test set.7 Wayne’s revolutionary reframing of metrics drove rapid progress in speech recognition for decades (Liberman & Wayne 2020).
These examples from the 1980s show how machine learning and pattern recognition technologies have only advanced through disputes over whose method fared better on different data sets. Researchers were not only getting the test error low because they had an application they cared about engineering, they also wanted to gloat to their friends and crush their enemies.
Finally, the most significant change in machine learning practice, which has only become broadly standard over the last decade, has been re-executable code. Getting people’s code before 2010 was incredibly difficult. Those who spent time writing good software packages (like SVMLight8 or Torch9) saw their methods receive more citations. But it took a while for the field to catch on that good software was also a faster way to research gold than almost any other path. It’s easier to beat someone in competition if you can take exactly what they did and only change a few parts. When Aha put the UCI repository together in 1987, we had barely invented FTP. In 2008, GitHub launched, and the world was never the same.
Inspired by Thomas Kuhn (1962), we can think of the scientific and engineering process as a massively parallel genetic algorithm. If we want to improve upon the systems we currently have, we might try a small perturbation to see if we get an improvement. If we can find a small change that improves some desired outcome, we could change our systems to reflect this improvement. If we continually search for these improvements and work hard to demonstrate their value, we may head in a better direction over time.
Think about your annual visit to the optometrist. To fit your prescription, your optometrist will try a random direction to change the current prescription. If the letters look better, you will take the new setting. If worse, you go back to where you started. You will find a perfect prescription for new glasses in a few minutes.
For scientific endeavors, we could perhaps gauge ‘better’ or ‘worse’ by performing random experiments—not randomized experiments per se, but random experiments in the sense of trying potentially surprising improvements. If our small tweak results in better outcomes, we can attempt to convince a journal editor or conference program committee to publish it. And this communication gives everyone else a new starting point for their own random experimentation.
This caricature isn’t that far off how we do scientific research. Moreover, it’s a conceptually sound, though inefficient, methodology. In the mathematical optimization theory, iteratively randomly testing improvements will crawl toward optimality. Researchers in online learning have even computed convergence rates for this method (Flaxman et al., 2005). Even if you make a bunch of mistakes (by, say, accepting a bunch of false positives because your rejection threshold is too low), the algorithm eventually compensates for the bad decisions (Jamieson et al., 2012).
A single investigator can only make so much progress by random searching alone, but random search is pleasantly parallelizable. Competing scientists can independently try their own random ideas and publish their results. Sometimes an individual result is so promising that the herd of experimenters all flock around the good idea, hoping to strike gold on a nearby improvement and bring home bragging rights. To some, this looks like an inefficient mess. To others, it looks like science.
There are downsides to reenvisioning science and engineering as swarming randomized experimentation. This process can and certainly does get stuck at local optima. The unfortunate experience shared by every bench scientist and every policy wonk is that most interventions simply do not work. But people are stubborn, prideful, and reluctant to admit defeat. Hence, scientific communities can chase illusory advantages for far longer than might seem reasonable.
Donoho highlights some potential structural problems as well, asking whether this paradigm entrenches power in big companies. He highlights trends in machine learning suggesting data sets and computations need to be ever larger to answer questions spored by frictionless reproducibility. Does that mean academic research is now impossible? In the case of machine learning, training hundreds of perturbations of large generative AI models is probably only doable in well-funded industrial labs. But on the positive side, the most clever insights about such models have recently come from academia. For example, the explanation of emergent behavior by Schaeffer et al. (2024) or the proposition of streamlined training methods by Rafailov et al. (2024). But the spectre of oligarchic tech companies stealing everyone’s thunder still looms large, and researchers must be creative in this new ecosystem to think about how to be a small player in a world of lurching giants.
And what does frictionless reproducibility bring to the human-facing sciences? There, the random random experiment paradigm—running randomly chosen randomized experiments—also runs into ethical roadblocks. Endless, mindless experiments on human populations are not feasible or ethical. To run a medical trial, there must be significant disagreement about whether a treatment has a beneficial effect. And as we move into fuzzier spaces like development economics, with incredibly weak interventions, outcomes that defy quantification, and power calculations calling for millions of subjects, perhaps we have moved outside the scope of the great genetic algorithm of science. For many societal problems, we should agree to settle for other means of sense-making beyond mindless datafication. Random experiments are powerful, but they are not the only means of understanding the world.
Finally, I would like to consider the human toll of the singularity of frictionless reproducibility. Overproduction is an unfortunate artifact of frictionless reproducibility. In this paradigm, a publishable result only requires downloading someone else’s code, making a few changes until reviewers might agree it is sufficiently different, and uploading a PDF. Ah, the joy of frictionless paper writing! Our research tooling has only lubricated the process. Writing LaTeX is so streamlined that every conversation immediately becomes an Overleaf project. Code can be git-pulled, modified, and effortlessly turned into a new repo. Everyone has a Google Scholar page highlighting citation counts, h-indices, and i10 numbers. These scores can be easily processed by the hiring managers at AI research labs. The conferences are all run by byzantine HR systems that accelerate form-filling and button-checking.
In 2023, the NeurIPS conference received nearly 13,000 submissions. Graduate students now regularly produce dozens of papers before their PhD. We all get lost in a sea of too much information where no one’s individual voice can be heard. We work harder and get less feedback on our work. Is frictionless reproducibility worth the human cost? Are we grinding young scholars too hard in the pursuit of generating stock photos, inaccurate advice, or answers to homework problems? Both things can be true: scientific progress can advance, while scientists’ lives diminish.
This leads me to a piece of advice I really hope we all consider. To all the young scholars out there in data science: You should finish your PhD with three papers that you are decidedly passionate about. Three papers that you can tell a strong story about. And if your friend asks you to work on some other project distracting from those three, it’s OK to say, ‘Your project is amazing, but I don’t have time to give it my all for the deadline.’ These are simple things. If we all did them, would we be impeding scientific progress? I think that’s the wrong question to be asking.
Despite its flaws, I remain bullish on frictionless reproducibility. Sharing data, code, and benchmarks can drive unimaginable progress. Fields in the throes of reproduction crises would benefit by worrying more about data and code sharing than forced, arbitrary epistemic rigor.
Frictionless reproducibility leaves us with many challenging research questions. What exactly is it about benchmarks that measure progress? Why is it that certain benchmarks carry more weight than others? What are the proper standards for reproducibility in individual fields?
And in what capacity can frictionless reproducibility be applied to the human-facing sciences? How can we surmount issues of privacy and sensitivity? Do we want to ask questions that can only be answered by intensive random experimentation and datafication? Does this rule out truly hard and impactful problems? And how can the data science community engage with the negative human consequences of overproduction? By introducing frictionless reproducibility, Donoho has articulated what data science is. But now it is up to the rest of us to dictate what it should be.
This article was adapted from a series of blog posts by the author at argmin.net. The main sources include “The Department of Frictionless Reproducibility,”10 “The Data Winter,”11 “The National Academy of Spaghetti on The Wall,”12 and “Too Much Information.”13 BR would like to thank the many interlocutors in the blog comments and on Twitter for constructive feedback on these blog posts. He also thanks Jessica Dai, Damek Davis, and Sarah Dean for many helpful comments and suggestions on this commentary.
Benjamin Recht has no financial or non-financial disclosures to share for this article.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., & Amodei, D. (2020). Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, & H. Lin (Eds.), Advances in Neural Information Processing Systems, 33 (pp. 1877—1901). Curran Associates. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Chow, C. K. (1962). A recognition method using neighbor dependence. IRE Transactions on Electronic Computers, EC-11(5), 683–690. https://doi.org/10.1109/TEC.1962.5219431
Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef
Flaxman, A. D., Kalai, A. T., & McMahan, H. B. (2005). Online convex optimization in the bandit setting: Gradient descent without a gradient. In SODA ’05: Proceedings of the Sixteenth Annual ACM-SIAM Symposium on Discrete Algorithms (pp. 385–394). ACM. https://dl.acm.org/doi/10.5555/1070432.1070486
Highleyman, W. H. (1963). Data for character recognition studies. IEEE Transactions on Electronic Computers, EC-12(2), 135–136. https://doi.org/10.1109/PGEC.1963.263427
Hardt, M., & Recht, B. (2022). Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press. https://mlstory.org/
Jamieson, K. G., Nowak, R., & Recht, B. (2012). Query complexity of derivative-free optimization. In F. Pereira, C. J. Burges, L. Bottou, & K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems (Vol. 25, pp. 2672–2680). Curran Associates. https://papers.nips.cc/paper_files/paper/2012/hash/e6d8545daa42d5ced125a4bf747b3688-Abstract.html
Kuhn, T. S. (1962). The structure of scientific revolutions. University of Chicago Press.
Liberman, M., & Wayne, C. (2020). Human language technology. AI Magazine, 41(2), 22–35. https://doi.org/10.1609/aimag.v41i2.5297
McCulloch, W. S., & Pitts W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5, 115–133. https://doi.org/10.1007/BF02478259
Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.) (1983). Machine learning: An artificial intelligence approach. Springer-Verlag. https://doi.org/10.1007/978-3-662-12405-5
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. https://doi.org/10.1007/BF00116251
Rafailov, R. Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36. https://openreview.net/forum?id=HPuSIXJaa9
Robbins, H., & Monro, S. (1951). A stochastic approximation method. The Annals of Mathematical Statistics, 22(3), 400–407. https://doi.org/10.1214/aoms/1177729586
Schaeffer, R., Miranda, B., & Koyejo, S. (2024). Are emergent abilities of large language models a mirage? Advances in Neural Information Processing Systems, 36. https://openreview.net/forum?id=ITw9edRDlD
Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. http://doi.org/10.1002/j.1538-7305.1948.tb01338.x
©2024 Benjamin Recht. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.