
People in 17th- to 20th-century Europe were not smarter or harder working than people in 10th-century Europe. How, then, were they able to figure out the structure of matter and the universe, go flying into space, double the average length of human life, and master the transfer of energy and information—in just 300 years? What was the mysterious mechanism driving science and innovation at such an incredible pace?
In my view, David Donoho’s (2024) article calls our attention to such a mechanism, the overwhelming effects of which are plain to see, while it itself remains invisible. Elusive as it may be, in what follows I will claim that understanding the mechanism that Donoho is laying out is perhaps the most crucial insight for today’s statisticians and data scientists.
We human beings have an exceptional ability to cooperate in large groups. One can easily make the case that it is this capacity—and not our thumbs—that made us the undisputed masters of our planet. Cooperation in large groups is so natural to us that we hardly ever notice it for the collective superpower that it truly is. Scientists and engineers in 17th- to 20th-century Europe, and gradually everywhere else, could accomplish so much in 300 years because they were able to cooperate with other scientists and engineers in large groups, which transcended space and time. Isaac Newton did not have to personally know Johannes Kepler; Albert Einstein did not have to personally know Albert Michelson and Edward Morley. Every single discovery in science, engineering, and medicine is the result of an individual extending the work of many hundreds or thousands of peers whom they never could have met in person.
How did they cooperate? What makes cooperation in a scientific community possible? They had (i) a joint goal, (ii) a mode of large-scale participation in the collaborative effort, and (iii) shared critical standards for choosing which contribution points ‘forward.’ Focusing on the scientific enterprise, the joint goal was to provide a human-graspable description of observed natural phenomena; the mode of participation was contribution of peer-reviewed papers to scientific journals; and the shared critical standards were Francis Bacon’s scientific method. Other large-scale collaborative communities such as engineering, medicine, and mathematics used variations on this theme, with different joint goals and different shared critical standards.
Donoho mentioned fish in water, and I was reminded of someone else who spoke of fish being blind to water: Marshall McLuhan, the 20th-century Canadian scholar who defined what we know today as the field of communication studies. McLuhan famously said “the medium is the message,” and in this asked us to focus our attention on the medium—the apparatus that enables human communication, and, therefore, human cooperation—and not on the information content it is used to deliver. Why the name “medium”? Like air or water, communication media tend to be invisible, yet have an all-encompassing effect on observable phenomena. One can become aware of the invisible medium by carefully observing the phenomena that it enables.
Similarly, the overwhelming importance of large-scale cooperation in science, by means of a joint goal, a mode of participation, and shared critical standards, is easily overlooked. And yet every time I open a tap, drive a car, turn on an electric switch, or use my glasses (not to mention the laptop I use as I print these words) I directly benefit from the fact that several hundred thousand people in the past 300 years agreed on a joint goal, a mode of participation, and shared critical standards, and were thus able to accomplish the incredible scientific, engineering, and medical feats we take for granted every day.
The time and place of the onset of the scientific revolution is no coincidence: The mode of participation for the scientific, engineering, and medical enterprises, as well as that of mathematics and statistics, is the academic journal, which has been with us since 1665. The journal was of course enabled by the printing press, invented 200 years prior. Church scholars in 16th- to 17th-century Europe saw the earth shake beneath their feet: a new awesome force arrived, which did not care about their creed and their approach to scholarly work. The printed scientific paper shuttered the world that preceded it. It carries human-graspable ideas: differential equations, chemical compounds, atomic numbers, p values, and tiny amounts of data contained in printed tables. The printed scientific paper has created an invisible yet overwhelming tilt toward scientific critical thinking and ideas that can be expressed simply and succinctly in words or equations. Such was, and mostly still is, the tradition in statistics.
History does rhyme: all this recently happened again. The digital age, which Donoho describes at length, shuttered the world created by print culture. Just as the core ingredients of print culture are ideas, the core ingredients of the digital culture are (i) data, and (ii) instructions for Turing’s universal machine (which we now call code). We have covered the planet with fiber optics and broadband wireless data antennas; we are now able to communicate both data and instructions for the universal machine at near-infinite speeds between hand-held devices and mammoth compute centers. One would have to be naïve to expect that this would result in anything short of a revolution on a scale comparable to, or possibly even greater than, the revolution that came in the wake of the printing press.
This revolution is now all around us. In the digital realm, the information being exchanged is as different from print as print was different from manuscripts hand-copied in monasteries. Specifically, the information exchanged in the digital realm involves data in large amounts, code, and code execution events. It tilts human action in a very different direction from print—away from critical thinking and intellectual discussions, and toward ‘doing.’
The institutions of science—including academic mainstream communication, planning, academic hiring and tenure, funding, and teaching—have largely ignored this revolution. Science meagerly embraced the digital age by adopting a digital replica of print culture—the ‘digital’ scientific paper, scientific journal online, the email. The ‘digital’ paper, journal, and email—which, in my mind, are not digital at all—are simple, even naïve, onscreen replicas of print culture. They are, essentially, completely identical to printed scientific papers, printed scientific journals, and Pony Express mail.
However, while the mode of participation in science resisted the digital age and remained in print culture, data-scientific activity in empirical scientific domains shifted wholesale to the digital realm. Research groups are collecting vast amounts of data, writing large amounts of code, and producing internally large amounts of code execution events. The mode of participation remains the academic journal. However, the print culture academic journal has no ability to communicate data, code, and events of code execution. The bulk of research work in many disciplines, which now involves gathering data and running code, remains blocked by the academic journal mode of participation, and is never shared with the community. This has led to a drastic decline in scientific cooperation, that, ironically, hides behind a massive bloating of the scientific literature. We are living a grand self-deception of sorts: the institutions of science have not changed since the 1900s; however, in effect, each lab is doing software work and mostly sharing words. What we are seeing in the scientific literature is growing meaninglessness, as what is done rapidly diverges from what is communicated.
A small community of scientists noticed this as early as the 1990s. Jon Claerbout, Donoho, and other pioneers have emphasized the importance of sharing data and code alongside each journal publication. Paraphrasing Claerbout, Donoho and coauthors (2008, p. 9) wrote, “an article about computational science in a scientific publication is not the scholarship itself, it’s merely scholarship advertisement. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.” From the perspective presented here, it appears that the call for reproducible research is much, much more than a correct way of doing things. It is an invitation for the scientific community to realize that print culture can no longer allow scientists to communicate the work they actually perform. Large-scale collaboration in science, as well as in engineering and medicine, requires a digital mode of participation that fundamentally transcends the journal paper and allows stakeholders of the scientific enterprise to communicate their actual work to their peers across time and space.
The field of statistics has by and large resisted this call, to its great detriment. True, many journals have adopted some means of data and code sharing; however, mainstream communication, planning, academic hiring and tenure, funding, and teaching have largely remained marooned in print culture. One cannot hope to be offered a tenure-track position in a leading statistics department, for example, based on groundbreaking work in data and code, and without journal publications. Progress in the academic field of statistics, and empirical scientific fields, has materially slowed down by staying rooted in print culture. Research in industry, less encumbered by traditional institutions, has been much faster to adopt truly digital cooperation. To return to the cutting edge of data science, it is necessary to depart from academic journals and embrace a truly digital mode of collaboration. Resisting this change, a scientific community may find itself experiencing a full-blown reproducibility crisis.
Things are a little different in some computer science research communities. Computer science was the first to lift the digital gauntlet, so to speak, and gradually moved away from peer-reviewed journals—first into conference papers, and, then, into tweets. Cutting-edge research work in parts of computer science, such as machine learning, is published first using tweets, GitHub, and Hugging Face—and receives much of its impact based on no journal publication at all. Those of us who grew up in print culture may be blinking in the headlights, just like the medieval-mindset church scholars who saw print culture engulf them. A member of the last generation to grow up in print culture, I am certainly one of those blinking.
The accelerating AI tsunami around us is a revolution that will define a generation of researchers. I used to explain this tsunami to myself by Moore’s law and its analogies for storage and bandwidth. Simply, I thought, compute, data, and bandwidth have multiplied to a point where very massive computer systems are now possible.
My take on Donoho’s article is that he would disagree. From the perspective offered above, his paper seems to make the bold claim that we are witnessing something comparable to the scientific revolution. A huge new community has emerged—which we may call empirical machine learning. As McLuhan taught us, while acceleration in AI is plain to see, to see the underlying cause, one has to inspect the medium. How does the empirical machine learning establish large-scale cooperation across space and time? Its shared goal is to make better and better AI models. Rather than academic journals, its mode of participation is X / Twitter (for announcing new results), GitHub (for sharing code), and Hugging Face (for sharing models). Rather than the scientific method, its critical standards are benchmark leaderboards. (Donoho explains at length the tremendous importance of benchmarks as a shared critical standard.) What Donoho called “frictionless reproducibility” has all the makings of large-scale cooperation, of the same kind that took humanity from medieval science to a human hopping on the moon in less than three short centuries.
The goal, mode of participation, and critical standards in empirical machine learning are completely different from those of print culture scientific communities. They created a huge, growing, and vibrant research community, capable of large-scale cooperation. This community has generated scientific breakthroughs (e.g., in protein folding and linguistics) by means utterly different from anything seen in print culture. And this community is still advancing in quantum leaps, with new breakthroughs weekly, and a rate of acceleration that is to print culture what print culture was to the 17th-century church scholars. History rhymes.
Donoho’s message is crucial for data science. In the digital realm, completely new horizons for data science are possible. For example, in “50 Years of Data Science” Donoho (2017) suggested that, once science shifts to a new mode of participation that makes all data and code produced in research work available, the work products of the scientific community themselves can become the object of a new kind of scientific study.
A discussion point I feel is missing in Donoho’s (2024) article involves the lack of critical thinking in empirical machine learning. As mentioned, print culture is conducive to critical thinking, while digital culture is conducive to ‘doing,’ which often comes with lack of critical thinking. Reflecting on the road ahead, Donoho’s article leads me to the conclusion that data science and data scientists must adopt a new culture that departs from print culture—one that is based on modes of participation that revolve around code and data, and whose critical standards combine scientific print culture (with its focus on critical thinking) and digital culture (with its focus on doing).
“Data Science at the Singularity” is an invitation to a vital discussion in statistics and data science. Personally, I believe that within the data science community this is the most important discussion of our generation.
Matan Gavish has no financial or non-financial disclosures to share for this article.
Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef
Donoho, D. (2017). 50 years of data science. Journal of Computational and Graphical Statistics, 26(4), 745–766. https://doi.org/10.1080/10618600.2017.1384734
Donoho, D. L., Maleki, A., Ur Rahman, I., Shahram, M., & Stodden, V. (2008).
Reproducible research in computational harmonic analysis. Computing in Science & Engineering, 11(1), 8–18. https://doi.org/10.1109/MCSE.2009.15
©2024 Matan Gavish. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.