Skip to main content
SearchLoginLogin or Signup

The Singularity in Data and Computation-Driven Science: Can It Scale Beyond Machine Learning?

Published onMay 24, 2024
The Singularity in Data and Computation-Driven Science: Can It Scale Beyond Machine Learning?
·
key-enterThis Pub is a Commentary on

(Donoho, 2024) highlights some of the advances over the past 10 to 15 years that have transformed computation-driven scientific discovery. These advances, he argues, have been driven by three principles introduced by data science that have found a broad adoption in science through frictionless open services: data sharing, code sharing, and competitive challenges. Data and code sharing essentially embody the transparency that is prescribed by the scientific method, while competitive challenges force standardization across efforts (force all approaches to fit into a unified structure) which facilitates both assessment and reuse of different methods.

Machine learning (ML) is a great example of the benefits brought by these principles. Increasingly, ML papers are accompanied by data and code. For example, Papers With Code is a free and open resource with ML papers, code, and datasets—it has accumulated over 126,512 papers with code. The community has not only organized several challenges but also built infrastructure to simplify the creation and management of challenges, such as Codabench, a platform that allows the creation of competitions and benchmarks. Open source libraries such as scikit-learn are available and serve as a means to disseminate methods introduced in the literature, making it possible for these to be easily adopted and improved. Machine learning, however, is an area in which there has been not only substantial investment both from government and industry but that also has garnered a large number of commercial users. Major industry players in artificial intelligence (AI) such as Meta and Google have released open source libraries, such as TensorFlow and PyTorch that have enabled many empirical machine learning advances. The use of ML in commercial settings has also contributed to the development of infrastructure that supports the development and deployment of ML-based solutions. A number of machine learning operations (MLOps) systems ("MLOps: Continuous Delivery and Automation Pipelines in Machine Learning," 2023; Oladele, 2024) are available that implement many of the features required for collaboration, transparency, and reproducibility that the scientific community has been discussing over the years, including provenance capture and experiment management.

The question is if and how we can attain the same success in less popular niche fields that do not have the commercial value of ML or the same scale in the number of users.

From MLOps to ScienceOps

While in principle I agree that “there is no essential obstacle to turning them into everyday habits, except the interest and diligence of participating researchers,” in practice there are still barriers—we still lack the necessary infrastructure to support science operations (ScienceOps). There is no question that we have made huge strides in building software and infrastructure that support the open-science principles of transparency and reproducibility. From virtual machines and experiment packagers to notebooks, it is much easier today to work in a reproducible fashion. Unfortunately, there are still many gaps. Many tools used in scientific discovery lack provenance capture, and even tools that support reproducibility have limitations; for example, packagers such as Reprozip (Chirigati et al., 2016) that capture dependencies are often limited to specific operating systems and computational environments. As a result, it is difficult to systematically capture all different steps executed across tools and computational environments, as well as their dependencies; and manually tracking these is both time-consuming and error prone.

We need a concerted effort and investment to build general and interoperable tools and infrastructure that make transparency and reproducibility for computations frictionless: support for transparency and reproducibility should be built-in in computational environments. Only with such support will we be able to have reproducibility for all scientific disciplines. We note that one of the recommendations (RECOMMENDATION 6-3) in the NASEM consensus report on Reproducibility and Replication in Science (National Academies of Sciences, Engineering, and Medicine, 2019) is that “Funding agencies and organizations should consider investing in research and development of open-source, usable tools and infrastructure that support reproducibility for a broad range of studies across different domains in a seamless fashion.”

Challenges, Replication Studies, and Abstraction

Challenges and problem platforms such as Codabench used by the ML community provide an ideal environment to perform replication studies at scale: you fix a problem, a dataset, and evaluate many different approaches to the problem. There are other models to incentivize replication. In the database community, the Proceedings of the VLDB Endowment invites the submission of Experiment, Analysis & Benchmark Papers, which include papers that present new benchmarks and experimental surveys that “compare multiple existing solutions (including open source solutions) to a problem and, through extensive experiments, provide a comprehensive perspective on their strengths and weaknesses” (Proceedings of the VLDB Endowment, n.d.).

An important component of challenges and experimental surveys is the creation of an abstraction to a problem and the definition of a common frame for different approaches to the problem. This is required to compare different methods and provide building blocks that better support Donoho's (2024) vision for “computing on digital research artifacts created by previous research computing [...] CORA ” (emphasis in original). The challenge that remains is how to broaden the adoption of these practices in different disciplines.

Reproducibility and Replicability as a Means to Build Trust

Being able to reproduce an experiment with a single-click is an admirable goal, but can we trust the results? The ability to reproduce or replicate an experiment does not imply correctness; however, it makes it possible to explore and better understand the experiment and its results. While for some experiments it is difficult to vary parameters (e.g., patient cohorts in clinical trials), it is easy to re-run a computational process using many different parameter configurations. Coupled with the abundance of computing power, this opens the opportunity for new semi-automated approaches to explain and debug experiments represented as computational pipelines (e.g., the BugDoc system [Lourenço et al., 2023]). This is an incipient area of research in computer science with the potential to both streamline scientific exploration and proactively identify problems in computational experiments.

We have many reasons to be optimistic. We have made substantial progress from  research data repositories and code execution environments, to challenge problem platforms; and we have also witnessed disciplines such as machine learning that have crossed the singularity. But there is no room for complacency: we need a concerted effort to attack the remaining challenges so that we can achieve frictionless science. The vision articulated by Donoho and its three components—FR-1: Data, FR-2: Re-execution, FR-3: Challenges—provide a valuable framework for designing approaches tailored to different disciplines.


Acknowledgments

Freire’s research is partially supported by the DARPA D3M and ASKEM programs, and NSF awards ISS-2106888 and CMMI-2146306.

Disclosure Statement

Juliana Freire has no financial or non-financial disclosures to share for this article.


References

Chirigati, F., Rampin, R., Shasha, D., & Freire, J. (2016). Reprozip: Computational reproducibility with ease. In Proceedings of the 2016 International Conference on Management of Data (pp. 2085–2088). Association for Computing Machinery. https://doi.org/10.1145/2882903.2899401

Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef

Lourenço, R., Freire, J., Simon, E., Weber, G., & Shasha, D. E. (2023). Bugdoc iterative debugging and explanation of pipeline executions. The VLDB Journal, 32(2), 473. https://doi.org/10.1007/S00778-022-00751-3

MLOps: Continuous delivery and automation pipelines in machine learning (2023, May 18). Cloud Architecture Center. Retrieved from https://cloud.google.com/architecture/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning

National Academies of Sciences Engineering and Medicine. (2019). Reproducibility and replicability in science. The National Academies Press. https://doi.org/10.17226/25303

Oladele, S. (2024, March 19). MLOps landscape in 2024: Top tools and platforms. Neptune. https://neptune.ai/blog/mlops-tools-platforms-landscape

Proceedings of the VLDB Endowment. (n.d.). PVLDB Volume 18 — Contributions. Retrieved April 24, 2024, from https://vldb.org/pvldb/volumes/18/contributions


©2024 Juliana Freire. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Connections
1 of 15
Comments
0
comment
No comments here
Why not start the discussion?