Skip to main content
SearchLoginLogin or Signup

Overcoming Potential Obstacles as We Strive for Frictionless Reproducibility

Published onMay 24, 2024
Overcoming Potential Obstacles as We Strive for Frictionless Reproducibility
·
key-enterThis Pub is a Commentary on

Introduction

Donoho (2024) has written an exciting and powerful piece on the basic ingredients of scientific revolution in the modern world of data science. The concepts he identifies and the consequences of their application bring clarity to why we are enjoying revolutionary advances in so many domains. In addition, the principles Donoho codifies are instructive for identifying deficiencies and bringing stagnant or misguided efforts across the singularity. In this discussion, I call out several questions that I hope will advance our collective understanding and application of the "frictionless reproducibility" (FR) principles Donoho enumerates.

Discussion Topics

What Is Friction?

Donoho identifies three principles of data science that are necessary for a field to join the singularity. However, Donoho is careful to state that existence of these three components is not sufficient alone. He states that data sharing (FR-1), code sharing (FR-2), and competitive challenges (FR-3) must be available in “mature form for daily practice, as frictionless open services.” The “frictionless” qualifier might be the secret sauce. Donoho offers “human labor” as an example of friction, but I wonder about less obvious and more devious forms of friction. For example, consider a modeling task for a biological system. If there is a lack of fundamental knowledge about a process in the pathway, time-consuming computational simulations that sample a parameter space may generate a suitable approximation of the mechanism, but if the compute times are prohibitively long, then the modeling step may bog down an otherwise frictionless task. Rather than concentrating efforts on acquiring more compute resources or optimizing an algorithm, it may be advantageous to directly pursue research that elucidates a closed-form representation of the mechanism in question. Perhaps this could be considered an example of hidden friction? In the example, a knowledge gap in the underlying domain is potentially baiting an eager scientist to work harder (i.e., access more compute resources) rather than work smarter (i.e., fill the knowledge gap).

Is There an Opportunity Cost of Divergence?

If a field does not gather around solutions that support the three principles of FR, then not only are they failing to realize the benefits of frictionless reproducibility, but they are likely actively diverging away from the objective. For example, as a field emerges, multiple data formats and incompatible data models are likely to emerge, thereby making federation more complex, if not impossible, as the provenance of essential metadata may be lost if not captured at the onset. Contrast that to a field that quickly unites on a standard and continues to grow with efforts going toward new services, rather than wasted on retroactively patching important legacy services.

Is divergence recoverable? There is a compounding cost for diverging approaches, but yet the organic diversity that comes with exploring a new problem space may be essential to find an optimal strategy. Put another way... divergence may be a necessary cost resulting from creativity and problem solving in the early stages of development, but act as an impediment in latter stages. How can a field of independent workers determine when they cross this threshold? And when they have, there likely needs to be a phase transition brought about by a single regulatory voice to define ‘standards’ and ‘best practices.’

As an example, let us consider the emergence of AlphaFold (Jumper et al., 2021) as a widely hailed solution to the ‘protein folding problem.’ Donoho discusses many aspects of AlphaFold and the critical assessment of protein structure prediction (CASP, Kryshtafovych et al., 2021) community challenge contest, which facilitated the scientific advances. I want to consider a different component of this problem space: the underlying molecular structure data. The Protein Data Bank (PDB, Berman et al., 2000) was founded in the 1970s to host structure data for the scientific community. The PDB file format followed the ‘standard’ of 80 characters per line, which itself follows from the physical limits of an IBM punch card format in the 1920s. Each line in a PDB file contains the indexing, coordinates, and metadata of a single atom in a molecular structure. With the explosion of structural biology and the ability to study increasingly complex systems, the data for a single atom often no longer fits within 80 characters. A newer replacement format was developed (Westbrook et al., 2022) and the authors note that “community-driven development of PDBx/mmCIF spans three decades, involving contributions from researchers, software and methods developers in structural sciences, data repository providers, scientific publishers, and professional societies.” In the early 2000s, the PDB became an international organization, the wwPDB (Berman et al., 2003), with member sites around the world. In 2014, the wwPDB adopted PDBx/mmCIF and in 2019, PDBx/mmCIF became the only accepted format for new depositions.

Perhaps not without coincidence, AlphaFold emerged in 2018 as a first-time entry and winner of CASP, contemporaneous with the wwPDB’s move toward the PDBx/mmCIF format, thus providing frictionless access to more than 200,000 structures and millions of computed models—an essential resource for training an AI model. AlphaFold2 was released in 2020, placed first in CASP, and was validated as a solution to the ‘protein folding problem’ (Herzberg & Moult, 2023). In the span of two consecutive CASP competitions, a novel method emerged, went through a single iteration cycle of refinements, and achieved the community standard of accuracy for the open problem.

The Biological Magnetic Resonance Data Bank (BMRB, Hoch et al., 2023) is the de facto community archive for nuclear magnetic resonance (NMR) data, and it is facing many similar data format challenges as those discussed previously for the PDB. The BMRB was founded on the NMR-STAR format (Hall, 1991), which it continues to use (Ulrich et al., 2019). However, there are compelling reasons (Gryk, 2021) to look at alternative data encodings (Schober et al., 2018). It will be interesting to observe how NMR data management evolves and what types of frictionless reproducibility services are spawned.

Is the Singularity FAIR?

The data (FR-1) and re-execution (FR-2) components of Donoho’s data science principles seem to capture some of the same fundamentals as those defined in the “The FAIR Guiding Principles for Scientific Data Management and Stewardship” (Wilkinson et al., 2016). To what extent are the FAIR (Findable, Accessible, Interoperable, Reusable) principles a necessary ingredient to achieve frictionless reproducibility? One possibility is that FAIR is a bottom-up definition of specific criteria that elicit the principles described by Donoho’s FR principles.

I am a co-investigator for the NIH-funded National Center for Biomolecular NMR Data Processing and Analysis (aka NMRbox, Maciejewski et al., 2017), where we deliver a virtual machine as a shared computing platform preloaded with hundreds of software packages used by NMR spectroscopists and the structural biology community at-large. The platform is backed by enterprise-class storage, computing, GPUs, and network infrastructure. We are keen on describing the platform as ‘the FAIR principles applied to software,’ in that we are encapsulating the complete computing environment, OS, dependencies, and so on needed to reproduce access and execution of software. In this case, reproducibility is the objective, so a community challenge (FR-3) is irrelevant. Perhaps reproducibility is encapsulated in FR-1 and FR-2; and the definition of a challenge (FR-3) drives the tools to deliver a specific result.

Is the Power of a Simple Scoring Function Dangerous?

In the ‘leave-one-out’ discussion of a field lacking challenges (FR-3), Donoho rightly calls out the missing competitive element and focus embodied in a community challenge. He notes the significance of “boiling down an entire research contribution essentially to a single number, which can be reproduced” and how that “enables a community of researchers to care intensely about a single defined performance number, and in discussing how it can be improved.”

What if the metric is wrong? What if the subtleties of a complex problem are not amenable to representation by a single scalar? What happens when metrics for locally optimal solutions are apparent, but ones for globally optimal solutions are not? What happens when the community is not (yet) mature enough to rally around a consensus-scoring function? I think it is important to recognize that finding an appropriate scoring function, let alone an objectively best one, is an ongoing task and might evolve as FR-1 and FR-2 provide a deeper understanding of the problem space.

I offer the Nonuniform Sampling Contest (NUScon, Pustovalova et al., 2021) as a motivating example. As Donoho mentions in his article, NUScon operates on the NMRbox platform by providing a sandbox for contestants to evaluate new NMR data-processing workflows and algorithms, where standardized metrics are applied for evaluation and optimization. The NUScon competition has released eight scoring functions, with two being deprecated and the other six designed to provide insight into specific aspects of NMR data processing. In fact, different metrics are often paired with different classes of test data. And yes... we could define a weighted sum of all the metrics crossed with all the test data and rally around a ‘single number,’ but that is merely a semantic solution, right? The advantage of multiple metrics is that specific deficiencies can be identified and addressed, rather than remain hidden within an aggregate scoring.

Conclusion

Frictionless reproducibility is a powerful concept. Donoho makes a compelling case for it being the conduit by which major scientific advances are flowing. The discussion I present here is intended to tease out areas where major research initiatives might stall and how the FR principles might define a path to recovery. I focused on the concept of friction and how it might be hidden (“What is Friction?”), the delayed cost of divergence that accrues while fields are exploring (“Is There an Opportunity Cost of Divergence?”), how the FAIR criteria might help define action items for fledgling projects to tackle with the expectation that it will aid them in achieving essential FR principles (“Is the Singularity FAIR?”), and how an ensemble of metrics may prove beneficial (“Is the Power of a Simple Scoring Function Dangerous?”).


Disclosure Statement

Adam D. Schuyler has received funding from the Miriam and David Donoho Foundation, in support of the NUScon Community Challenge.


References

Berman, H., Henrick, K., & Nakamura, H. (2003). Announcing the Worldwide Protein Data Bank. Nature Structural & Molecular Biology, 10(12), 980. https://doi.org/10.1038/nsb1203-980

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., & Bourne, P. E. (2000). The Protein Data Bank. Nucleic Acids Research, 28(1), 235–242. https://doi.org/10.1093/nar/28.1.235

Donoho, D. (2024). Data Science at the Singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef

Gryk, M. R. (2021). Deconstructing the STAR file format. In B. T. Usdin (Ed.), Volume 26: Proceedings of Balisage: The Markup Conference 2021. Mulberry Technologies. https://doi.org/10.4242/BalisageVol26.Gryk01

Hall, S. R. (1991). The STAR file: A new format for electronic data transfer and archiving. Journal of Chemical Information and Computer Sciences, 31(2), 326–333. https://doi.org/10.1021/ci00002a020

Herzberg, O., & Moult, J. (2023). More than just pattern recognition: Prediction of uncommon protein structure features by AI methods. Proceedings of the National Academy of Sciences, 120(28), Article e2221745120. https://doi.org/10.1073/pnas.2221745120

Hoch, J. C., Baskaran, K., Burr, H., Chin, J., Eghbalnia, H. R., Fujiwara, T., Gryk, M. R., Iwata, T., Kojima, C., Kurisu, G., Maziuk, D., Miyanoiri, Y., Wedell, J. R., Wilburn, C., Yao, H., & Yokochi, M. (2023). Biological magnetic resonance data bank. Nucleic Acids Research, 51(D1), D368–D376. https://doi.org/10.1093/nar/gkac1050

Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S. A. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., . . . Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589. https://doi.org/10.1038/s41586-021-03819-2

Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., & Moult, J. (2021). Critical assessment of methods of protein structure prediction (CASP)—Round XIV. Proteins: Structure, Function, and Bioinformatics, 89(12), 1607–1617. https://doi.org/10.1002/prot.26237

Maciejewski, M. W., Schuyler, A. D., Gryk, M. R., Moraru, I. I., Romero, P. R., Ulrich, E. L., Eghbalnia, H. R., Livny, M., Delaglio, F., & Hoch, J. C. (2017). NMRbox: A resource for biomolecular NMR computation. Biophysical Journal, 112(8), 1529–1534. https://doi.org/10.1016/j.bpj.2017.03.011

Pustovalova, Y., Delaglio, F., Craft, D. L., Arthanari, H., Bax, A., Billeter, M., Bostock, M. J., Dashti, H., Hansen, D. F., Hyberts, S. G., Johnson, B. A., Kazimierczuk, K., Lu, H., Maciejewski, M., Miljenović, T. M., Mobli, M., Nietlispach, D., Orekhov, V., Powers, R., . . . Schuyler, A. D. (2021). NUScon: A community-driven platform for quantitative evaluation of nonuniform sampling in NMR. Magnetic Resonance, 2(2), 843–861. https://doi.org/10.5194/mr-2-843-2021

Schober, D., Jacob, D., Wilson, M., Cruz, J. A., Marcu, A., Grant, J. R., Moing, A., Deborde, C., De Figueiredo, L. F., Haug, K., Rocca-Serra, P., Easton, J., Ebbels, T. M. D., Hao, J., Ludwig, C., Günther, U. L., Rosato, A., Klein, M. S., Lewis, I. A., . . . Neumann, S. (2018). nmrML: A community supported open data standard for the description, storage, and exchange of NMR data. Analytical Chemistry, 90(1), 649–656. https://doi.org/10.1021/acs.analchem.7b02795

Ulrich, E. L., Baskaran, K., Dashti, H., Ioannidis, Y. E., Livny, M., Romero, P. R., Maziuk, D., Wedell, J. R., Yao, H., Eghbalnia, H. R., Hoch, J. C., & Markley, J. L. (2019). NMR-STAR: Comprehensive ontology for representing, archiving and exchanging data from nuclear magnetic resonance spectroscopic experiments. Journal of Biomolecular NMR, 73(1-2), 5–9. https://doi.org/10.1007/s10858-018-0220-3

Westbrook, J. D., Young, J. Y., Shao, C., Feng, Z., Guranovic, V., Lawson, C. L., Vallat, B., Adams, P. D., Berrisford, J. M., Bricogne, G., Diederichs, K., Joosten, R. P., Keller, P., Moriarty, N. W., Sobolev, O. V., Velankar, S., Vonrhein, C., Waterman, D. G., Kurisu, G., . . . Peisach, E. (2022). PDBx/mmCIF ecosystem: Foundational semantic tools for structural biology. Journal of Molecular Biology, 434(11), Article 167599. https://doi.org/10.1016/j.jmb.2022.167599

Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), Article 160018. https://doi.org/10.1038/sdata.2016.18


©2024 Adam D. Schuyler. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Connections
1 of 15
Comments
0
comment
No comments here
Why not start the discussion?