I am impressed by the number, diversity, and seriousness of the discussions.
I sense general agreement about the data science reality that has been forming over the last decades, some of the larger forces driving it, and the resulting permanent changes to research computing and scientific publishing that will ensue. I also sense concerns and important reservations, maybe not so much about what my article says, but what it does not begin to acknowledge and discuss.
Each discussant makes unique and valuable points about issues exposed by these rapid changes across a broad range of fields represented and topics discussed. I can only admire and celebrate these contributions. In this rejoinder I will refer to the original article under discussion with the label [DSatS].
Distinguished Neuroscientist Terry Sejnowski (2024) makes topline reference to ‘Data Science as an ecosystem’ and helpfully summarizes the growth of its academic presence at the University of California, San Diego in recent years. It is simply perfect that Sejnowksi’s data science institute is named after Taner Halicioğlu, one of the earliest employees of an internet hegemon—and that it has come together in the decade after that hegemon went IPO.
Sejnowski’s top billing of the ecosystem concept suggests a theme we will use often in this rejoinder: there is a different, and possibly better label than ‘singularity’ for what is happening today, something more biological and ecological. (As contributor Zhiwei Zhu [2024] points out, HDSR editor-in-chief Xiao-Li Meng [2019] has previously called data science an ecosystem, even arguing so at some length in the pages of HDSR. Still, Sejnowski [2024] takes it to the topline with inspiring confidence, and neuroscience-based credibility.)
I think we can all agree that today’s ‘data science ecosystem’ comprises a complex web of entities and relationships. We can also agree that today’s ‘data science ecosystem’ is evolving, changing meaningfully over the span of a year or two. For example, Zhu (2024) explicitly uses the ‘evolution’ concept in referring to what has been happening. Of course, as a celebrated neuroscientist, Sejnowski is the one among us who can use such terms most authoritatively.
I will try to use these ecological and biological concepts to organize parts of the rejoinder, although ultimately other concepts will be needed.
Globally, data science is not a unity. Instead, like our natural world, it is an interlocking collection of evolving webs of entities and relationships, each one involving its own ecological domain. Ecological domains range from small academic research communities to large industries, to collections of government-regulated actors in an industry (e.g., Food and Drug Administration regulees), or even a truly global audience.
As our discussion shows, data science as seen by academic statisticians involves CRAN, which establishes a frictionless research exchange for R code and data sets; while data science as seen by academics in molecular physics and computational chemistry involves the essential database registries Protein Data Bank and Biological Magnetic Resonance Bank, the compendium of executable code resources at NMRbox, and community challenges for protein folding—CASP—and accelerated nuclear magnetic resonance (NMR) experimentation—NUScon. In the field of optimization, one finds shared code packages, problem instances, and community challenges at Benchopt. The data science ecosystem in today’s machine learning research community includes code resources such as PyTorch, scikit-learn, individual project code shared through GitHub, data resources shared through OpenML, models shared through Hugging Face, experimental settings shared through Weights & Biases, community challenges such as those hosted by ChaLearn, NeurIPS, and Kaggle.
Such ecological webs continue in research field after research field and industry after industry. I am aware personally or through the contributors of data science webs in drug development, oil exploration, in computational genomics, and epidemiological research. At the very largest scale, data science resources include open source code projects like Python or open standards-based projects like C++, global data resources like terrain maps and road networks, oceangoing vessel-tracking registries, and weather satellite results. Many individual research communities and industries have shaped themselves around, and depend upon, these global resources.
The ‘links’ in each figurative web are relationships induced by frictionless services. The ‘nodes’ can be either service access points (e.g., github.com, openml.org), or human users, or bots. Where such services do not exist, or disappear, the webs reform, and participants adapt to employ other frictionless services; when new services are introduced, individual webs reshape themselves around the new opportunities. The webs are interconnected in a multiscale fashion; small communities may depend in essential ways on resources provided by larger communities or even global resources—that is, the webs are symbiotic.
Clearly, one can go much farther in demonstrating that biological and ecological concepts apply to the world of data science. In fact, they suggest an alternative to the ‘singularity’ narrative. The recent sudden emergence of data science has a clear analog in the history of life on earth. Quoting Wikipedia:
The Cambrian explosion […] is an interval of time approximately 538.8 million years ago in the Cambrian period of the early Paleozoic when there was a sudden radiation of complex life, and practically all major animal phyla started appearing in the fossil record.[2][3][4] … Before early Cambrian diversification,[b] most organisms were relatively simple, composed of individual cells, or small multicellular organisms, … As the rate of diversification subsequently accelerated, the variety of life became much more complex, and began to resemble that of today.[12] (Cambrian explosion, 2024)
This passage discusses a relatively rapid transition, the ensuing diversification, and the reference to acceleration. Continuing with the ecological viewpoint, Science Daily, reporting on the journal article of Eden et al. (2022), reports on an important precursor:
Early animals formed complex ecological communities more than 550 million years ago, setting the evolutionary stage for the Cambrian explosion […] (PLOS, 2022)
Combining the analogy data science community <-> complex ecological community and the scientific conclusion that emergence of complex ecological communities -> Cambrian explosion suggest again the idea we have discussed in [DSatS], namely that emergence of data science —> singularity. Where now we recognize that `singularity’ could instead be called `data science’s version of the Cambrian explosion.’
Thanks again to Terry Sejnowski for opening this mind-broadening avenue for discussion.
Ecosystems are highly variable. Traversing one parallel of latitude around our planet, we would encounter many different local ecosystems. Some of those systems may be hotspots of evolution while some have been largely static for millennia.
In the same way, traversing science and technology, we come across a wide variety of data science ecosystems. Some of those are still operating on pre-Cambrian principles, where individual investigators, comparable to unicellular organisms, operate independently. Namely, they do computing on their own and do not publish code or data. On the other hand, some communities already have hundreds or thousands of researchers eagerly attacking problems jointly, sharing work and results fluidly; continuing the analogy, these latter communities could be compared to multicellular organisms or even megafauna or even societies of megafauna.
Some of our contributors make top-line reference to the unevenness, across data science ecosystems, in adoption of the [FR-1]-[FR-2]-[FR-3]-FRX paradigm, and also demonstrate, even in largely adhering communities, that there are important gaps in awareness, willingness, training, and technology.
In particular, Lorena Barba (2024), Juliana Freire (2024), and Peyman Milanfar (2024)—among the most computation-involved intellectuals on the planet—tell us ‘we are not uniformly there yet,’ even in their computation-soaked fields, and itemize some of these gaps.
Barba (2024) shows that compliance with the best practices idealized in the original article is, in places, shockingly incomplete. To fully accomplish the digital transformation of scholarship now underway will require interventions by funding agencies and by professional and scholarly communities. More training, more education, and more resources should be invested, possibly for years and years to come. Better technology may still be needed. Her contribution is really a valuable informative essay itemizing at length much of the work and many of the considerations that others may not yet perceive.
Freire (2024) also points to shortfalls, and makes top-line reference to the possibility that empirical machine learning might turn out to be the only domain where the new paradigm is really scalable. She further inspires us with new (to me) examples of digital research communities, and new technical problems that will have to be solved. She describes a required enabling technology for full adoption of what [DSatS] called CORA (Computation on Research Artifacts). The possibility of CORA has tantalized or mesmerized generations of computational scientists. Possibly Freire, or her computer science colleagues, can seize on her suggestions and create the needed technology.
Milanfar (2024) wisely foresees a lengthy transition period for frictionless reproducibility (FR), during which we will live a hybrid existence, one foot in the ‘pre-FR’ world and one foot in the ‘post-FR’ world. Milanfar, now ensconced at internet hegemon Google, is already much more in the post-FR world than many academics, yet he remembers well the pre-FR world in which his career began. He tells us that the transformation from pre-to-post is both bitter and sweet, as the amazing technical progress evident today in some of the more-FR-adherent, more-nearly-post-FR digital communities upends long-held beliefs and approaches.
I agree with so much in these commentaries!
I just would not underestimate how fast the world is transforming digitally. After a career embedded in the ‘user illusion’ of a stable, almost timeless research ecosystem, and gradually improving computing tools, I am trying now to tear myself away from such illusions.
In recent years, it seems that every time I have told myself that it was too much personal time or energy to pursue some specific ambitious computation, I have soon enough learned that I was very wrong. For almost any ambition I could think of, somewhere, somehow, someone had already had a similar ambition, invented the process to satisfy it, and made it available to everyone. If I only had known! I kick myself to think what was not done, largely because I forgot to heed William Gibson’s advice: the future is already here.
The ecological viewpoint naturally leads to biological thinking. In an ecology with competing types of organisms, which ones will win out? Malthus and Darwin would tell us: those that produce offspring faster and that also adapt faster to the changing environment. They would continue that other organisms might not actually disappear, but having a slower rate of increase would cause them to become less proportionally visible among the surviving organisms over time. Research communities embracing the FR principles might be compared to organisms that reproduce faster or that adapt faster. Over time, such communities might well be, proportionally, the ones visible, as part of the exponentially growing cast of scientific and technological resources.
In the pre-Cambrian era, unicellular organisms dominated, and today’s fossil record of that period is largely unproductive. With the transition to multicellular organisms, and eventually megafauna, the record displays stunning productivity, with fossils across an astonishing range of multicellular organisms. Drawing parallels between unicellular organisms and single-investigator/single-computer/not-shared computational science, we are living through the emergence of the data science analog to ‘multicellular organisms.’ In the 2000s, internet hegemons appeared (at a smaller scale than today), invented the term ‘data science’ as a job search category, and developed an internal social organization of researchers that, Bin Yu (2024) testifies, was far more effective than the academy’s disconnected single-principal-investigator model. She tells us that these organizations were able, thanks to a deep bench of managed talent and super-sized compute resources, to achieve world-historical data engineering and software development advances. In 2014, humans for the first time uploaded more than one trillion images to internet servers. The images were largely deposited into the servers of three internet hegemons.
Such data science advances eventually enabled research projects that academics could not touch. The internet hegemons marched steadily up the stock market leaderboard and today securely dominate the data science and data engineering worlds—like lions in the jungle, the prototypical multicellular hegemons.
Zhu (2024) adopts a historical perspective and points to earlier success stories of corporate organization, such as the credit scoring system developed by early data scientists in response to the advent of large consumer retail stores in the late 19th century. He thereby shows the effectiveness of corporate organization in the pre-history of data science.
Corporations are now-traditional social organizations with thousand-year-old roots. Matan Gavish (2024) and Victoria Stodden (2024) celebrate the Internet’s ability to create 21st-century forms of social organization, specifically ones that can deliver research advances. In particular, frictionless services over the Internet enabled benchmark challenges where (in some cases) thousands of unaffiliated researchers working individually rapidly built workflows that improved on the empirical performance of a leaderboard’s current top-performing workflow. The analogy here is to societies of organisms, like ants, baboons, or humans; the capabilities of the collective are vastly larger than the capabilities of the individual. In fact, the individual ant or baboon cannot even survive alone; and it is questionable whether survival by a human on the proverbial desert island is really survival.
Each of these organizational models has strengths and impressive capabilities. An often-cited example of hegemon industrial research productivity is the development at Google of the Transformer architecture that drives modern large language models. Another is DeepMind’s development of AlphaGo to ‘solve’ the game of Go, which led eventually to many other advances in gameplay. The OpenAI ChatGPT moment of November 2022 had worldwide media and political impact, confirming the powerful capabilities of corporate data science and engineering.
On the other hand, noncorporate approaches have also delivered. The academic ImageNet Challenge ILSVRC 2012 famously drove today’s deep learning revolution (Russakovsky, 2015). And challenges continue to deliver, in large and small ways too numerous to inventory. A nice example occurred while this was being written: the Vesuvius Challenge (2024), where indistinct digital records of fossilized papyri were successfully made legible by a competition for a substantial prize (US$750,000). From a systems viewpoint, the credit goes to the social organization rather than the individual winners. The winners themselves, attacking the same problem outside the contest setting, probably would never have performed to the level that they did within the contest setting.
Having related the ecology/biology imagery to several among our discussants, we now face the many specific comments about the FR world that cannot be mapped into that narrative.
Benjamin Recht is a profound mathematical talent, a founder of the beautiful Matrix Completion literature in applied mathematics and signal processing (Candès & Recht, 2012). Uncharacteristically for a mathphile, he became a pioneer in empirically probing the foundations of empirical machine learning; his groundbreaking empirical work explores how deep learning’s reliance on data sets like ImageNet might, or might not, generalize to other data sets (Recht et al., 2019; Zhang et al., 2021).
Recht is a stylish and passionate writer, whose numerous blog posts include many sparkling gems. Drawing on several of his posts, as well as original historical research in his recent book, Patterns, Predictions, Actions (with Moritz Hardt), Recht (2024) favors us with a lengthy essay packed with fascinating details about the history of machine learning, but also with up-to-the-minute dispatches from the front, of today’s combatants in the conference publications game. By doing so, he graces the pages of HDSR with an essential contribution that may attract many new readers to the HDSR orbit.
This is a particularly important essay for our theme because, as we know, empirical ML is the field where FR has been most completely adopted and where experience with the post-FR world has been accumulating the longest. Recht therefore gives us our clearest view of where much of the scholarly world will sooner or later be going.
Recht’s essay agrees that, at least in empirical ML, FR has indeed been a driving force in recent years. However, the advent of FR has, to Recht, clear downsides. Equipped with the tools of FR, promiscuous publication is becoming the norm. Computer science (CS) PhD students are now completing their studies with 12 papers—a historically unheard-of level of productivity. Recht thinks this is not a good state of affairs (as do I) and offers advice to PhD students that they instead try to have three strong papers. (Of course, his way of saying this is more compelling and stylish than my quick recounting).
Recht is right, and everyone should read his essay. My first reaction on reading his advice was to forward his words to my own CS-adjacent PhD student.
Explicitly or implicitly, Recht points to many themes that arise in the commentaries of our remaining contributors, to whom we now turn.
Brian Wandell (2024) makes a crucial observation: he does not spend much time on the computing activities where friction has been/could be engineered out of his life. The activities that most consume his time involve careful study of the scientific situation, such as the context of the measurements taken in an experiment or the details of the instrument taking the measurements, or how the instrument was operated. Milanfar (2024) also argues for the centrality of human engagement, insight, and understanding.
Wandell (2024) observes that his laboratory’s scientist-trainees expect instead to do a lot of FR-infused computing—that is, to follow the `download-baseline-modify-recompute’ pattern so characteristic of modern computing life. Apparently, he counsels them to be thinking more, and thinking harder, about the meaning of the data and the interpretation of the results.
Milanfar (2024) observes that challenges produce machine learning models with “startling levels of predictive accuracy,” but reminds us that they do not produce researchers capable of interpreting these models.
Wandell (2024) tells us that DeepNet procedures fitting data in this way are not actually explaining; they are netsplaining.
Remarkably, 15 years ago, such lack of understanding of what drives specific predictions would have excited a complete veto on the use of such procedures, yet today ‘lack of understanding’ is relatively speaking, almost ignored, especially among younger researchers—a sign of how the success of this paradigm has changed young researcher attitudes.
Several other commentators worry about the temptations posed by the advent of frictionless reproducibility—specifically the likely tendency toward mindless computation and trivial publication. Peter Rousseeuw (2024) explicitly brings up this concern, and I think that many people are silently worried about this, or should be. In particular, Stephen Ruberg (2024) points to the worthlessness of many or most of the entries in one of the InnoCentive challenges (now Wazoku Crowd), despite the gains posted at the top of the leaderboard. Ruberg points to the tendency of many data science–driven searches for clinical biomarkers not to replicate, even when published in leading journals, and implies that part of the problem is the temptation toward “data sets of convenience” or “models of convenience.”
For the many researchers who do not want to be seduced into wasting their time on trivial, mindless computing, Wandell’s (2024) guidance to think more about the context and meaning of the data and instrumentation should be top of mind.
We might criticize the download-tweak-compete cycle promoted by today’s emerging FR paradigm as an overly narrow view of what data science really should involve. But what would be a fuller view?
Yu (2024) explores the need for trustworthy and explainable AI, and goes beyond it with a call for veridical data science. While some of the FR principles can be said to align with VDS, her VDS principles ask for much, much more out of the data science life cycle.
In a thoughtful essay, she reminds us that data science needs to include not just reproducibility and openness, but also PCS (predictability-computability-stability) principles to obtain interpretability and trustworthiness.
Yu’s book with Rebecca Barter—Veridical Data Science: The Practice of Responsible Data Analysis and Decision Making—will be out in October 2024 (MIT Press); but is available in prerelease form online. I encourage readers to preorder it, as I have, and incorporate its lessons in their data science teaching and practice.
We are fortunate to have three discussants who have implemented and administered benchmark challenges.
Ruberg, Joseph Salmon, and Adam Schuyler represent very different fields, illustrating the real breadth of applicability of challenges across the world of research.
They give us priceless testimony about the inside story, in detailed, information-rich essays that I found very rewarding to read and think about.
Ruberg (2024) discusses InnoCentive (aka Wazoku), one of the earliest (2001!) challenge platforms to embrace a wide range of challenges. This was propelled by a research vision inspiring Lilly, an industrial leader in drug discovery. Ruberg describes his experiences with a subgroup identification challenge inspired by proprietary concerns in drug discovery; exceptionally, he got Lilly to pay him to design synthetic data for this challenge and also to pay reward money for the challenge. He shows how corporate researchers can tap into the power of challenges engaging outside talent, thus combining the ‘multicellular’ corporate and ‘social’ challenge models in one research project.
Salmon (2024) describes Benchopt, a platform for challenge problems (originally for optimization algorithms), hosting algorithms, leaderboards, data sets, and tabulating performance comparisons. He details the inside story of grit, inspiration, and searches for funding, and points to the widespread underrecognition of the need for funding agencies to endorse and fund challenge platforms, and to fund platform maintenance. Clearly, Salmon and collaborators had great drive, talent, and vision to create this platform. I am personally in awe. This deserves real recognition!
As Salmon observes, my attitude is not widely enough shared. The essential role of benchmarking and challenges as the backbone of our new approach to developing data processing algorithms is not yet widely discussed and the informal and formal mechanisms for professional credit for this sort of ‘platform work’ are not well enough established.
Schuyler (2024) describes the role of data in the larger context of computational physical chemistry and molecular physics, including the Protein Data Bank, which drove recent challenges in protein folding. He raises several intriguing questions about the data challenge era that are both very perceptive and that break new ground in the discussion. Most apropos for us, he describes projects in which he is involved intimately, including the Biological Magnetic Resonance Data Bank; NMRbox, a platform for reference algorithms, and NUScon, a platform for challenges to develop better methods for advanced NMR spectroscopy.
Schuyler’s ‘platform work’ is essential; his context gives a clue to how future impact and recognition are likely to arrive. He is part of a team at University of Connecticut Health Sciences Center acting as fulcrum of the “Network for Advanced NMR,” a National Science Foundation–funded US$40-million distributed facility with networked, digitally controlled instruments sited at three leading research universities nationally. This new type of software-enabled ‘networked-instrument’ has prompted creative thinking about code and data infrastructure, and about research projects that may improve the use of the instrument by the community. It seems to me that such new types of instruments, and the new communities they foster, should lead to new support and professional credit for projects that live in this new digital world.
In the challenge paradigm there is a metric, a leaderboard, and a winner to be declared. We always learn who won, but usually get no information—not even a hint—explaining why the winning entry outperformed other proposed solutions. The cause could be a better architecture, a better set of training parameters, a better initialization, a longer training time, augmented data, noise, or any combination; or even nothing recognizable to us.
The adoption of the challenge paradigm generally gives no explicit weight to the meaning/structure of the winning proposal. In such an environment, we eventually transform psychologically. We drop the requirement that we understand why the winning entry does better than other procedures. Stodden (2024) labels the situation “No one knows why and nobody cares.”
Rousseeuw (2024) spotlights two unpleasant consequences of the challenge lifestyle. A young researcher hoping to do well at such a contest is advised to make a small modification to an existing system, and hope for a small boost in performance rather than to make a more fundamental change. The presumptive goal of small performance boosts leads to short projects requiring no real intellectual investment.
Young researchers simply will not try major innovations, which no doubt start out with spotty performance, but over time, with tweaking and adjustments, might eventually do far better.
A second point made by Rousseeuw is very subtle but crucially important. Today, technological gadgets—like deep learning and Transformer architecture—get reputations based on breakthroughs in challenges. We may know that with the help of a technology, a certain team has won a challenge, and we then assume that this technology was the reason, and it gradually becomes social gospel that the technology has yielded a benefit. This attaches great prestige to the innovators of the technology.
But we never get to observe the counterfactual. Suppose this technology were not implemented and adopted. Would challenges have reached modern levels of performance without that technology, by another technological route? Jonathon Phillips of NIST (National Institute of Standards and Technology) presents evidence across decades that face recognition challenges obey a kind of Moore’s law, improving error rates by a set fraction each year, universally across technologies. The advent of deep learning improved face recognition error rates by no more and no less than it had improved by any of the other technology innovations. In a related vein, Ludwig Schmidt has presented evidence that Transformers may not really have been necessary to obtain modern levels of performance with today’s GPUs (Schmidt, 2023); although this might have been important with circa 2017 GPUs; yet everyone celebrates the advent of Transformers in 2017 as if it is still responsible for the modern era.
Such confoundings may make us fundamentally misunderstand the events of the last decade. And therefore, misunderstanding our path to the future.
I encourage all readers to think carefully about Rousseeuw’s (2024) staggeringly important points.
Before data science, there was the discipline of statistics, which for 300-plus years used probabilistic generative models to study the effect on observed data caused by changes in the parameters of the models. Statisticians had a clear understanding of cause and effect, and it matched the physicist’s understanding. This understanding allowed statisticians to fit parameters that matched the observed data—assuming the generative model was true. If the generative model did not seem to acceptably fit, statisticians would develop better models. Statisticians believed their models; they were true. All was good.
In the new era of data science, there is only data, there is no truth. There is no belief in any generative model. But then, how can there ever be real understanding? Or, at least, an understanding comparable to hard sciences like physics and chemistry?
Keyon Vafa (2024) asks us whether the challenge problems paradigm has a way to determine cause-and-effect.
In typical instances, challenge problems do not have an underlying notion of true model; they only have true data. This makes Vafa’s question correct, and essential. If we do not study what truly is driving the data generation process, how can we decide what features of an empirical approximation are true? To really appreciate this, it helps to have a deep understanding of theoretical causal inference. Comparing causal models traditionally means comparing probability densities in high-dimensional spaces with subtly different factorization properties. If we do not have access to a generative model, we cannot really discuss factorization properties.
Vafa points to a way out—challenges based on synthetic data, where the generative model is known. He describes the American Causal Inference Conference (ACIC) challenge as one example. Yet, he sees an obstacle—the difficulty of quantifying performance in the task of causal modeling.
Vafa observes that empirical causal inference techniques do not propagate virally, and advances possible reasons. Possibly, Vafa is describing a field ripe for disruption. Remember that not so long ago, randomized experiments were not common in empirical economics; but at some point, the dam broke, and we now see a profusion of economics experiments, which can now win Nobel Prizes (in former decades, Economics Nobel Prizes emphasized theory).
Ruberg (2024) conducted challenges with synthetic data, although not for causal inference. One such challenge aimed to design subgroup identification procedures, a milestone task along the road to precision medicine. Traditionally, this task would have been considered a job for established statistical theory and methodology. His challenge opened up the task to a much broader public. In a detailed description of how his challenge was set up and conducted, he also describes the synthetic data set creation process and the task performance evaluation. Finally, he describes the actual conduct and outcome of the challenge, and the somewhat mixed results. We urgently need more such work.
Like in the causal inference case, statistical theory and methodology challenges have been slow to arise and slow to spread. Possibly, theoretically trained researchers are simply slow to embrace the changes implied by data-first thinking, and so slow to recognize the idea that challenges could play a role in theory and methodology. Again, this situation seems prone to disruption.
Andrew Gelman (2024) is one of our most perceptive and involved academic statisticians. His blog posts are mandatory reading for a large swath of behavioral scientists and for me some have been the most urgently important posts I have ever read. Many of my favorite blog posts concern his efforts to spotlight reproducibility failures in prominent journals, or by prominently ensconced academics.
Gelman brings out the crucial point that computational reproducibility alone is not enough; like several other commentators, he points to the fact that poor thinking may be accompanied by adequately reproducible code and data and benchmark scores. This is aligned with Wandell’s (2024) hope that we will all think much harder about metadata and instrumentation.
I agree; in today’s viral age the use of scientism to cover up bad thinking and then spread it globally is becoming a huge concern. I will merely point out that computational reproducibility may help us to expose poor thinking much more quickly than we ever could have done, without code or data.
Gelman (2024) further offers up the question of whether I actually believe everything I say. He offers up the passage where I wrote, in part “data science projects making full use of [FR 1], [FR 2] and [FR 3] are truly empirical science” [while those lacking one of these are not].
Gelman’s point is that empirical science has been going a long time, without any of these three items. This is an excellent point!
I agree that, at this moment, many scientists, including Gelman and myself, can recall a time before the FR principles were anywhere to be found. I agree that lots of good empirical science was done without these ingredients. I also want to point out that my article is description, not advocacy. The phrase he calls attention to is actually my description of the new mindset.
In fact, there are many things about the new mindset that trouble me. Nevertheless, to continue to live in this new world, I must be able to understand the mindset. I think my statement is correct, as long we are clear that it describes where the mindset is headed.
We all get old and will be forgotten. This is also true of former science practices. As new empirical science is done using FR principles, it will soon enough eclipse older science done by previous rules—even science that had survived for a very long time, in the pre-FR world. Here is an example. The Jeffreys-Bullen (JB) travel-time tables were developed in the 1930s to predict earthquake P-wave travel times as a function of distance (a virtuoso feat of applied Bayesian analysis BTW). Fifty years later, the JB tables were still heavily used and its creators famous and admired.
Today, frictionlessly available alternatives offer a bit better accuracy, but are implemented much more conveniently in runnable code. Who still remembers the JB team, or its masterful achievements? This shows the power of FR to erase the past. I am by no means happy with this. But I am describing it, right here.
An inevitable mainstay of our future research life is looming on the horizon. Someday, every computational research article will be published with an integrated digital record that includes everything needed to understand, reproduce, and build upon that research. In [DSatS] I called this CORA, for Computation on Research Artifacts. As Freire (2024) pointed out, we do not yet know concretely the formats and tools that will make this a reality; but it is coming.
One day, we will be able to query research articles to produce new visualizations or new predictions that the original authors never thought of, either to understand them better or to extend their reach to new audiences or settings. Some recent uses of large language models with scientific articles suggest the desire to do this is more widespread than we knew; but in my view the capability to do this is not yet available, despite tantalizing rumors. A mature capability of this kind will eventually need access to much more than the text of the original articles; in other words, it will need the full access implied by CORA. We do not really have this yet.
How are things done today?
As Gelman (2024) testifies, today’s young scientist-trainees share Jupyterlab notebooks to document the work they are doing and to collaborate with colleagues, but as Gelman testifies, older generations do not necessarily go along. In any event, notebooks are not part of scientific publication, which lives, for now, in a world of digitized ‘paper’ augmented with pointers to other digital resources.
As Salmon (2024) testifies, new concepts in scientific publishing have been tried in recent decades, and are being used today; Salmon in particular points to Image Processing On Line (IPOL), which for 2 decades served as an outlet for publication of reproducible research developing new methods in image processing.
Experiments like IPOL aside, we still do not have a mature, comprehensive solution to the digital transformation of publications in computational research. But one day, we will.
Stodden (2024) calls the missing new publication model the computable scholarly record. We do not know how this comes together, but Stodden probably knows better than the rest of us, as it has loomed so large in her intellectual life. While [DSatS] describes the beginning and middle of the ongoing digital transformation of computational research, the endgame begins when the outlines of Stodden’s computable scholarly record come into sharp focus.
Operating at a cosmic, visionary level, one discussant stands magnificently above the fray. Gavish (2024) beholds the sweep of technological change from the Middle Ages until today, and relates it to our topic under discussion. For him, FR heralds the advent of computable documents, and this moment is as important to human history as the arrival of Gutenberg’s printing press.
Some readers may find this over the top, but I am pretty sure Gavish has thought this through and, given more time and space, would convince everyone with an elaborately thought through, persuasive, bulletproof case.
Especially if you hold a critical view of the changes that FR has brought, and will bring, ask yourself: Is this not a true revolution? And if your answer is no, then please consider the contrast that Gavish explores. While manuscripts in the Middle Ages were available to a select few, the Gutenberg press completely changed the situation. Literacy went from the skill of a few devoted priests to commonplace, in mere generations.
Is the data science change we see today not comparable in nature? Does it not portend a massive democratization in the use of data science tools and concepts? Will not the spread of these tools and ideas be much faster than the spread of print and literacy? Are not some of our reactions to this ongoing revolution in data science availability akin to reactions of the priesthood to the spread of literacy post-Gutenberg? Will there not be global, human-historical consequences of this availability?
Gavish’s discussion of the (implied) ‘Gutenberg Phase Transition’ is even more cosmic than I make out. Please, let us all read and think about this!
In this rejoinder to the discussants of [DSatS], I spotlight a selected few of the fascinating, urgent, and important issues that were raised by the discussants. The FR changes that have been driving the new world of data science are not yet completed and they are not all for the better.
I particularly discussed two alternatives to the ‘singularity’ narrative used in [DSatS]. Both alternatives provide rich examples, analogies, and metaphors for ongoing happenings in data science. They may help us think clearly about the near- and long-term futures.
One, an ecological/biological analogy courtesy of Sejnowski (2024), Zhu (2024), and HDSR editor Meng (2019), allowed us to understand the evolution and spread of data science, and to replace ‘singularity’ in our terminology with ‘Cambrian Revolution,’ the advent of multicellular organisms and social behavior.
Two, a historical analogy courtesy of Gavish (2014), allowed us to replace ‘singularity’ in our terminology with ‘Gutenberg Revolution’; the advent of widely reproducible information (books).
Each of these analogies helps us to hold in mind the profound nature of the changes happening in the data science world.
I have been profoundly stimulated by engaging the many perceptive comments of the discussants.
One item remains. As Zhu (2024) reminds us, in common parlance the ‘singularity’ refers to the ‘AI singularity.’ While data science is indeed going through a frictionless reproducibility singularity, we should not confuse this with the AI singularity, as the concerns and effects are so different.
David Donoho has no financial or nonfinancial disclosures to share for this article.
Barba, L. (2024). The path to frictionless reproducibility is still under construction. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.d73c0559
Cambrian explosion. (2024, May 31) In Wikipedia. https://en.wikipedia.org/w/index.php?title=Cambrian_explosion&oldid=1226585498
Candès, E., & Recht, B. (2012). Exact matrix completion via convex optimization. Communications of the ACM, 55(6), 111–119. https://doi.org/10.1145/2184319.2184343
Eden, R., Manica, A., & Mitchell, E. G. (2022). Metacommunity analyses show an increase in ecological specialisation throughout the Ediacaran period. PLOS Biology, 20(5), Article e3001289. https://doi.org/10.1371/journal.pbio.3001289
Freire, J. (2024). The singularity in data and computation-driven science: Can it scale beyond machine learning? Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.28953700
Gavish, M. (2024). A familiar, invisible engine is driving the AI revolution. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.f86474d0
Gelman, A. (2024). Hopes and limitations of reproducible statistics and machine learning. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.040d5523
Hardt, M. and Recht, B., 2022. Patterns, predictions, and actions: Foundations of machine learning. Princeton University Press.
Meng, X.-L. (2019). Data science: An artificial ecosystem. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.ba20f892
Milanfar, P. (2024). Data science at the precipice. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.a8d932cc
PLOS. (2022, May 17). First animals developed complex ecosystems before the Cambrian explosion [Press release]. Science Daily. https://www.sciencedaily.com/releases/2022/05/220517151829.htm
Recht, B. (2024). The mechanics of frictionless reproducibility. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.f0f013d4
Recht, B., Roelofs, R., Schmidt, L. and Shankar, V. (2019). Do ImageNet classifiers generalize to ImageNet? In K. Chaudhuri, & R. Salakhutdinov (Eds.), Proceedings of the 36th International Conference on Machine Learning (pp. 5389–5400). Proceedings of Machine Learning Research. https://proceedings.mlr.press/v97/recht19a.html
Rousseeuw, P. (2024). An alternate history for machine learning? Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.fe8a9269
Ruberg, S. (2024). Analytics challenges and their challenges. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.c03fd9da
Russakovsky, O., Deng, J., Su, H. et al. (2015). ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115, 211–252. https://doi.org/10.1007/s11263-015-0816-y
Salmon, J. (2024). Collective intelligence and collaborative data science. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.bf3d6b1d
Schmidt, L. (2023) Are transformers necessary? A data centric view on generalization. Simons Institute for the Theory of Computing. https://simons.berkeley.edu/talks/ludwig-schmidt-university-washington-2023-08-18
Schuyler, A. D. (2024). Overcoming potential obstacles as we strive for frictionless reproducibility. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.0bd1cfba
Sejnowski, T. J. (2024). Data science is an ecosystem. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b7e4d3fa
Stodden, V. (2024). On emergent limits to knowledge—Or, how to trust the robot researchers: A pocket guide. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.dcaa63bc
Vafa, K. (2024). Is causal inference compatible with frictionless reproducibility? Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.5782e6b6
Vesuvius Challenge. (2024, Feb 5). Vesuvius Challenge 2023 Grand Prize awarded: We can read the first scroll! https://scrollprize.org/grandprize
Wandell, B. (2024). Metadata, instrumentation, and netsplaining. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.07d15e5e
Yu, B. (2024). After computational reproducibility: Scientific reproducibility and trustworthy AI. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.ea5e6f9a
Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115. https://doi.org/10.1145/3446776
Zhu, Z. (2024). Rethinking the data science singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.4d0c37ad
©2024 David Donoho. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.