Many of the data scientists I know and admire live in a research world that starts with computer files. Their work launches from these files into interesting and compelling data analyses. David Donoho (2024) takes us on a tour of this world and how it has changed for the better in recent years. Like many of us, Donoho worked for years to create tools for data sharing and reproducible research, and happily the world has changed from one in which statisticians and computer scientists no longer plead for data: Many publications in premiere venues require it! They no longer need to send message after message asking to see the analytical code, only to be turned away: The code is on GitHub! These changes in culture have enabled many people to build on prior work and demonstrate their advances in competitive challenges. As Donoho describes, progress in fields that adopted this model is faster than before. Frictionless reproducibility and the data singularity are here!
I live in a different research world that spans image systems and neuroscience. That world does not start with computer files, but rather with expensive instruments, unique biological tissue, and complex experimental protocols. The work depends on delicate experimental methods that must be followed to acquire quantitative MRI data, or special lenses and sensors that measure light fields. In this world, even instruments from the same vendor and identical model numbers can differ and require extensive calibration. Frictionless reproducibility has not arrived to this world (Gibson, 2003).
I have my world in mind as I respond to Donoho’s enthusiastic post. He is surely right that some fields have benefited and there is a “dramatic acceleration in research progress that happens when a research discipline adopts the new practices and crosses this singularity.” I want to take a deep breath to express some concern for my world, even as I endorse his view that frictionless data and computation sharing has been a very positive advance.
Much of my day is spent contending with a great deal of data friction. It is not because I have not awakened to frictionless reproducibility, but rather because in my world the metadata is often at least as important to understand and analyze—but more difficult to capture—as the data itself. What were the instructions given to the subjects in the experiment? How were their eye movements measured? What were the steps taken to calibrate the magnetic field gradient?1
The need to check the metadata between laboratory experiments is not simplified just by acquiring more data. Rather, the work requires that the measurements be made with extreme care and analyzed with the metadata in front of mind. Sharing the data is a good start, but it is important to share knowledge of the instruments and protocols and to understand how these factors influence the data. It is not a matter of ‘waking up’ to the value of frictionless data (see his footnote 14). Some work requires a deeper understanding of the metadata, and this requirement slows a field’s entrance to Donoho’s singularity.
Building new rigs to measure new things is hard. And often expensive. Sure, it can be hard to download, store, and organize a lot of data; but, my university (the same one as Donoho’s) is committed to making it easier and easier for people to access and analyze large, important data sets. Working with frictionless data will become even easier.
Consequently, I often find myself in conversations with students—or colleagues—who are eager to take the ‘download and analyze’ path for their research. They see the enthusiasm for frictionless data, and for career reasons they need to generate new publications. The frictionless route is an easier path. Related, some researchers propose using conventional instruments to generate large data sets for others to download.
This frictionless data path is not trivial or wrong. But it is different from building new instruments or running highly controlled experiments. And please remember that the data from these new instruments—obtained by specialized, highly controlled experiments—are often what provide the value of the frictionless data! We must not shift the balance of rewards to research that is based on data reuse from research that is based on data creation.
Researchers who are not deeply familiar with the experimental metadata often use the modern tools of deep learning (neural networks) for their research. Their achievements often take the form of a neural net that generalizes better than the current state-of-the-art algorithm. The neural net itself can be quite complex and difficult (impossible?) to understand. It does not have embedded within it the hard-fought and detailed knowledge of the physical or biological systems.
For this reason, many of us—probably most of us—feel that neural networks are not the same as scientific explanations. They are wonderful tools that can improve the performance of instruments and make useful predictions. But ‘netsplaining’ is not the same as a scientific explanation. I fear that the search for scientific explanations will lose material support.
The concerns I express here—metadata, instrumentation, netsplaining—remind me of the recent experience our society has had in a very different area: journalism and news sites. In that case it is the journalists who investigate the news and bring new facts to light. News sites share the news along with endless reanalysis and opinion pieces. Frictionless sharing of the work by journalists has become very large, powering some of our largest companies. The transformation has taken place in a way that is choking the funding available to the organizations who provide the news. Could the same thing happen to original scientific inquiry? As we grow the size of frictionless research, can we still preserve resources for focused experiments and new instrumentation?
I do not think of myself as a fearful person, and despite these concerns I look forward to enjoying the dream of a frictionless data research environment—the data science singularity! I write this commentary to remind us of another important point: “In Dreams Begin Responsibilities.”2 I hope we keep these responsibilities in mind.
Brian Wandell has no financial or non-financial disclosures to share for this article.
Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef
Gibson, W. (2003, December 4). The future is already here – it’s just not evenly distributed. The Economist.
Schwartz, D. (1937, December). In dreams begin responsibilities. Partisan Review.
©2024 Brian Wandell. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.