The reproducibility of research results is the basic requirement for the reliability of scientific discovery, yet it is hard to achieve. Whereas funding agencies, scientific journals, and professional societies are developing guidelines, requirements, and incentives, and researchers are developing tools and processes, the role of a university in promoting and enabling reproducible research has been unclear. In this report, we describe the Reproducibility Challenge that we organized at the University of Michigan to promote reproducible research in data science and Artificial Intelligence (AI). We observed that most researchers focused on nuts-and-bolts reproducibility issues relevant to their own research. Many teams were building their own reproducibility protocols and software for lack of any available options off the shelf. If we could help them with this, they would have preferred to adopt rather than build anew. We argue that universities—their data science centers and research support units—have a critical role to play in promoting “actionable reproducibility” (Goeva et al., 2020) through creating and validating tools and processes, and subsequently enabling and encouraging their adoption.
Keywords: data science, reproducibility, processes and tools, generalization, the role of universities
Validating scientific discoveries requires that others understand exactly what was done with respect to data collection, curation, and analysis, and be able to repeat the process, in order to confirm the findings and build upon them in pursuit of future knowledge. This may be loosely termed ‘research reproducibility,’ and we will discuss more extensively its definition in the next section.
Despite the centrality of reproducibility to science, the production of reproducible science has often proven elusive. In recent decades, concerns over the widespread irreproducibility of research results have brought forth a much-heightened awareness of this problem. We will not extensively present the case, but will briefly refresh the reader’s memory. In a survey of more than 1,500 scientists, Baker (2016) found that over 70% have tried and failed to reproduce another scientist's research outcomes, and more than half have failed to reproduce their own results. An extensive body of literature documents a plethora of published irreproducible results, analyzes underlying causes, and provides recommendations for researchers, academic institutions, funding agencies, and scientific journals (examples include Aczel et al., 2020; Alter & Gonzalez, 2018; Boulbes et al., 2018; Brito et al., 2020; Kitzes et al., 2017; Madduri et al., 2019; Munafò et al., 2017; Nosek et al., 2015, 2021; Waller & Miller, 2016).
Ensuring that data science research results can be reliably reproduced is particularly challenging, especially as data science methods are broadly adopted across a wide range of disciplines and data science projects involve increasingly complex data and pipelines (Stodden, 2020; Waller & Miller, 2016). Computational environments may vary drastically and can change over time, rendering code unable to run; specialized workflows might require specialized infrastructure not easily available; projects might involve restricted data that cannot be directly shared; the robustness of algorithmic decisions and parameter selections varies widely; and crucial steps in data collection (e.g., wrangling, cleaning, missingness mitigation strategies, preprocessing) may involve choices invisible to others. In addition, when data science methodology is applied to vastly different research domains, the reproducibility of results hinges on domain-specific factors as well as components of the data science methodology.
Many funders, journals, and professional societies have already developed and adopted guidelines and verification methods, promoted open science, and encouraged reproducible research through both incentives and stringent requirements (for examples, see Alberts et al., 2015; Collins & Tabak, 2014; Frery et al., 2020; McNutt, 2014; also see SIGMOD 2022 Availability & Reproducibility). However, many researchers have not yet adopted best practices to ensure reproducibility; and simply requiring data and code sharing is not sufficient. A few examples illustrate the reality. Laurinavichyute et al. (2022) were able to reproduce the results of only 34% to 56% of the 59 papers that they chose to inspect from the Journal of Memory and Language, after the journal implemented an open data and code policy. Hardwicke and colleagues (2021) reported that they were only able to reproduce the results of 15 of 25 papers in Psychological Science that received open data badges. Stodden et al. (2018) found that only 26% of the articles in their sample (from the journal Science) were computationally reproducible with various levels of effort and expertise required.
Universities play a critical role in supporting, incentivizing, and regulating research at their institutions, and are crucial in ensuring responsible conduct of research, implementing safety protocols, managing conflicts of interest, and so on. They are well-positioned to play an active or even leading role to provide resources and incentives for making research more reproducible. However, generally speaking, universities have not taken systematic approaches to do so. At the Michigan Institute for Data Science (MIDAS), we have recently started exploring the role of a university and its research institutes in this space.
MIDAS can be regarded as a typical research-focused data science institute at a research university. As the focal point of data science and AI at the University of Michigan, its central goal is to enable the transformative use of data science and AI methods for both scientific and societal impact. Not only does MIDAS focus on promoting and developing best practices for ethical and reproducible research as part of its core mission, but it applies this mission to data science and AI research across an enormous array of disciplines with wildly different epistemological approaches and data use practices. The lessons we learn through our exploration are therefore likely to be informative for other academic data science institutes and their universities.
Naturally, one critical question is where we can have the greatest impact. A necessary first step to this end is to understand the priorities of our researchers and the most significant gaps. The gaps could stem from poor understanding of the conceptual and practical factors that prevent reproducible results, the lack of tried-and-true workflows and tools to ensure reproducibility, the paucity of clear guides for how information should be stored and linked, missing standards for how to deal with key types of restricted data, or a failure to properly incentivize researchers’ efforts to ensure reproducible research. Each of these gaps would require different types of support from universities and academic research institutes. To understand how researchers thought about problems around reproducible research and what solutions they could offer, we designed a data science Reproducibility Challenge, which was, to our knowledge, the first such event in a university. We used this as a bottom-up approach that would allow us to understand the priorities of our researchers, identify gaps, flag hurdles, and plan for support and resources.
Very early in the process of designing the Reproducibility Challenge, deliberations among the interdisciplinary planning committee members revealed that ‘reproducibility’ can mean, or at least imply, very different things for different researchers. These discussions often considered a number of related concepts, including reproducibility, replicability, reliability, and validation, among others. As such, our discussions represented the typical situation when researchers discuss reproducibility and practical approaches for improvement. While we believed that a definition is essential for defining the scope of the Reproducibility Challenge and as a common language for meaningful interaction among researchers, we also realized, as can be seen below, that in practice, separating related concepts such as reproducibility and replicability is often neither feasible nor desired.
A 2019 report from the National Academies of Sciences, Engineering, and Medicine (NASEM) defined “reproducibility to mean computational reproducibility – obtaining consistent computational results using the same input data, computational steps, methods, and code, and conditions of analysis; and replicability to mean obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data” (p. 1).This definition allows a clean separation of two concepts: the computational reproducibility serves as a benchmark for ‘the reliability of the implementation’ of a particular data-gathering, prepping, and analytical workflow; and replicability refers to ‘the robustness of a scientific conclusion’ about a specific research question, given variations in data, analytical methods, and human decisions in the workflow. As Meng (2020) argued extensively, a lack of reproducibility indicates errors in a study that need to be corrected, but simply reproducing the exact analytical process is by itself unable to inform the validity of the scientific finding that results from the process. A lack of replicability, on the other hand, might or might not indicate errors in research studies.
However, in practice, the boundary between reproducibility and replicability is often blurred. It is almost never possible to reproduce the exact same computational environment. Even with complete transparency of data, code, and workflow, the computing environment will differ, even the random seeds in a code will differ and may render differences in numerical results, and there will likely be subtle differences in human decisions in the workflow. Furthermore, researchers’ effort to make their findings reproducible is often part of the larger effort to make their research findings more robust and reliable. In other words, all the efforts toward reproducible research serve the larger goal of making scientific discoveries more reliable through minimizing human errors and controllable variations in research outcomes. We can categorize variations in research outcomes into four forms: 1) Inherent properties of a phenomenon that are reflected in different data sets / observations and uncontrollable by the researchers; 2) experimental or analytical limits, such as limits of the instruments for measurements and data collection; 3) controllable variations across experiments and analyses (such as different version of software, different batches of enzymes, different animal housing conditions, or different statistical assumptions); and 4) human errors or inadequate sharing of research information. In this sense, computational reproducibility addresses 4), whereas researchers’ efforts to make their findings more robust and reliable often address 3) and 4), and sometimes 2).
In order to capture a wide range of practices to make data science research outcomes more reliable, we used a ‘functional definition’ of data science reproducibility to guide the design of the Reproducibility Challenge: ‘Obtaining consistent results in multiple studies conducted independently, using robust scientific measurements, rigorous study design and data analysis, and with transparency in data, documentation, and code sharing that allows full confirmation.’ Fruitful efforts in improving any of these goals were regarded as improvements in reproducibility.
We invited research teams across the University of Michigan who make significant use of data or data science methods to submit reports that used their research projects to demonstrate their approaches and tools for reproducibility. We used the following categories in the call for submissions:
An illustration of a definition of reproducibility for at least one application of data science;
Metadata with sufficient transparency to allow full understanding of how the data collection, processing, and computational workflows or code resulted in a study’s findings;
An analysis workflow that can be reproduced by others, even with different hardware or software;
A thorough description of key assumptions, parameters, and algorithmic choices in the experimental or computational methods, so that others can test the robustness and generalizability of such choices.
Procedures or tools that other researchers can adopt to improve data and code transparency, analysis workflow, and to test the sensitivity of research findings to variations in data and in human decisions.
We deliberately created broad categories to encourage a wide range of submissions so that we could assess the needs of our researchers and where they invest their effort. In the end, we categorized the submissions somewhat differently, based on the focus of the submitted reports (see Section 5 for a summary and examples).
Our focus was on common data, analytics, and computational features across research domains, not on experimental or analytical features that are specific to a narrow research domain. We hoped to identify practices that could provide lessons for the broad data science research community. This was based on our understanding that there are major challenges to develop reproducibility best practices that can be adopted across domains or even across research groups in the same research domain. Reproducing a data science project often includes both specific considerations within each discipline and general considerations across disciplines. Discipline-specific considerations include measurement errors, study design, data collection and interpretation issues unique to each discipline, such as sampling methods in survey research or batch effects in high-throughput genomic research. Considerations shared across disciplines include issues such as variations of the computing environment, specialized workflows, restricted data that cannot be directly shared, the varied level of robustness of algorithmic decisions and parameter selections, and invisible choices for crucial steps in data collection and preparation. However, researchers’ efforts to create tools and processes are often limited to reproducing results for their own projects, with little consideration for whether these tools can be adopted by others outside their immediate research field, or even by others in the same field. The lack of tools and processes ready for use by data scientists beyond one’s own research group and immediate collaborators could be a significant barrier to reproducible research. The consensus of the planning committee, therefore, was that, even though making a specific data science project reproducible could depend heavily on the specific content of the research, it is reasonable and feasible to seek components of best practices that are applicable broadly.
This section may be particularly useful for those who are interested in planning similar events.
In Stage 1 of the Challenge, we received 22 submissions that involved 59 University of Michigan researchers (faculty, staff, and trainees; Table 1). Most of the submitted projects also involved collaborators at other organizations.
Research Unit | No. of Researchers Entering the Challenge |
---|---|
Department of Internal Medicine | 7 |
Department of Biostatistics | 6 |
Institute for Social Research | 6 |
Department of Chemical Engineering | 5 |
Department of Surgery | 4 |
Department of Mechanical Engineering | 3 |
Department of Psychiatry | 3 |
School of Information | 3 |
Department of Astronomy | 2 |
Department of Cell and Developmental Biology | 2 |
Department of Computational Medicine and Bioinformatics | 2 |
Department of Epidemiology | 2 |
Department of Neurology | 2 |
Department of Pathology | 2 |
Institute for Healthcare Policy and Innovation | 2 |
Department of Anesthesiology | 1 |
Department of Computer Science and Engineering | 1 |
Department of Emergency Medicine | 1 |
Department of Health Management and Policy | 1 |
Department of Statistics | 1 |
Ross School of Business | 1 |
Unit for Laboratory Animal Medicine | 1 |
University libraries | 1 |
Total | 59 |
Note. Many researchers have appointments in more than one unit. We list only their primary affiliations. The Reproducibility Challenge was at the Michigan Institute for Data Science (MIDAS).
Research domains reflected in these submissions include astronomy, bioinformatics, biology, clinical research, computer science, education research, energy research, epidemiology, genomics, healthcare, materials science, operations research, psychology, and survey methodology. This wide range is an indicator of both the importance of reproducibility for domains that employ data science methodology, and the difficulty of generalizing reproducible practices.
In Stage 2, an interdisciplinary judging panel assessed the submissions based on four criteria: the clarity and thoroughness of the report; its potential as an example for others to follow; the ease and accuracy with which the results described in the report could be reproduced (if applicable); the broader impact of the work toward addressing reproducibility challenges. We did not develop a rubric. The fact that this was the first such Challenge meant that we could not anticipate what reports we would receive and how we would best evaluate them; thus we had little to base on if we were to develop a rubric. Fortunately, because of the manageable number of submissions, the judges were able to discuss each submission thoroughly, evaluate them holistically, make decisions (award, honorable mentions, or neither) and provide extensive comments to each submitting team. It was especially helpful that the judges included researchers from social sciences, natural sciences, medicine, and engineering, which enabled them to evaluate the diverse range of submissions and draw connections with their own research. Their experience will help us develop a rubric in the future when we organize similar events.
Stage 3 was a virtual Reproducibility Showcase over three months. Eleven teams selected by the judges each gave online presentations/tutorials about their approaches to reproducibility. Video recordings and slides of these presentations were made freely available online. The virtual showcase was not planned, but was rather a reaction to the COVID-caused delays in the judging process and award ceremony. However, adding the showcase turned out to be a much appreciated channel for the teams to share their experience with the research community.
Stage 4 was a Reproducibility Day in the fall of 2020. The judges selected four submissions as winners and four honorable mentions. These teams gave presentations during the event.
For Stage 5, based on the submitted projects, we built an online collection of tools and processes. Just to give two examples, our collection includes: 1) A procedure of letting an ‘independent’ data analyst (someone who is not on a study team) to carry out a code review to examine statistical assumptions, study design, and the analyses; and 2) a step-by-step guide for using Docker and GitHub to share the complete computational workflow and code. (Docker is a platform for building standalone and easy-to-use software packages—called ‘containers’—that include all components needed to run an analysis or a set of analyses. GitHub is a cloud-based platform where users store, manage, and develop software, code, and workflows.)
However, most such tools were developed within a narrow research domain and there are large variations in how easily they can be adopted by other researchers, within the same research domain or across domains.
We also note that researchers who entered the Challenge included those at different career stages and tracks, and all groups were represented in the projects that were selected to be winners and honorable mentions, in projects presented in the showcase and in projects included in the online resource collection (Table 2).
Junior Faculty | Senior Faculty | Staff Scientists | Postdocs and Graduate Students | |
---|---|---|---|---|
All submissions | 13 | 22 | 10 | 14 |
Winners and Honorable Mentions | 5 | 13 | 3 | 7 |
Projects featured in the Showcase | 6 | 10 | 2 | 8 |
Projects included in the online resource collection | 12 | 19 | 8 | 9 |
Note. The Reproducibility Challenge was at the Michigan Institute for Data Science (MIDAS).
A main goal of the Reproducibility Challenge was to discover what dimensions of reproducibility elicited researchers’ investments of time and effort. We saw this as an indicator of the issues, challenges, and solutions that were most relevant to researchers. A principal observation was that researchers were not focusing on the ‘why’ and ‘what’ of reproducible research; instead, they focused on the ‘how’—the practical approaches for reproducibility. While this could have been a function of how we posed the scope of the Challenge or the expectations submitters had about what the reviewers would focus on, it was still telling that the submissions were so clearly pragmatic. We categorized the submitted reports into five themes, defined as follows (see Table 3 for details on the categorization of the reports):
Title of the report | Data type or research domain | Themes |
---|---|---|
A tutorial on propensity score based methods for the analysis of medical insurance claims data | Medical insurance claims data | 2 |
American Economic Association (AEA) Data & Code Repository at openICPSR | General | 3 |
Assessing the reproducibility of high-throughput experiments with a Bayesian hierarchical model *** | Bioinformatics | 1, 5 |
Automatic capture of data transformations to improve metadata *** | General | 2, 3 |
Automatic Transformation and Integration to Improve Visualization and Discovery of Latent Effects in Imaging Data ** | Imaging data | 3 |
BioContainers: an open-source and community-driven framework for software standardization | Biomedical sciences | 3 |
Codifying tacit knowledge in functions using R | General | 2, 3 |
Complete documentation and sharing of data and analysis with the example of a micro-randomized trial * | General | 2 3 |
dtangle: accurate and robust cell type deconvolution ** | Biomedical sciences | 3 |
Effective communication for reproducible research | General | 3, 4 |
Embedding the computational pipeline in the publication | General | 3 |
Heuristics for optimization problems with validation across instances | General | 2, 5 |
Multi-informatic Cellular Visualization | Biomedical sciences | 3 |
Principles and tools for developing standardized and interoperable ontologies *** | Biomedical science | 2 |
Replicating predictive models in scale for research on MOOC *** | Education | 3, 4, 5 |
Rigorous code review for better code release * | General | 3 |
The classification permutation test: A flexible approach to testing for covariate imbalance in observational studies ** | Biomedical sciences | 3 |
Transparent, reproducible and extensible data generation and analysis for simulation in materials science * | General | 2, 3 |
Unifying initial conditions of galaxy formation simulation for research replication | Astronomy | 2 |
*Indicates projects that were selected as winners of the Reproducibility Challenge at the Michigan Institute for Data Science (MIDAS).
** Indicates three projects that were grouped together as one winner
*** Indicates honorable mentions.
Column 2 indicates whether a report focuses on a specific data type or research domain, or whether the reported approach can be of value to the broad scientific community. Column 3 lists the themes of the submitted reports: 1. Definition and quantification of reproducibility. 2. Reducing variations in study design and standardizing initial conditions. 3. Comprehensive documentation and sharing of data, code, and workflow to ensure the reproducibility of a specific project. 4. Reproducible research with restricted data. 5. Replication of published studies and meta-analysis.
Note. Broen et al.’s (2021) report was not included in the online resource collection by the authors’ request. At a later point we will add it.
Definition and quantification of reproducibility. As we noted above, clearly defining and measuring reproducibility is an important topic in the scholarly investigation of reproducibility and contributes to our understanding of the extent of controllable and uncontrollable variations. Only one submission touched on this topic, however, as the focus of most of our researchers was more practically oriented (Zhao et al., 2020).
Reducing variations in study design and standardizing initial conditions. These submissions went beyond the narrow definition of computational reproducibility and focused on standardizing some of the many choices that researchers must make. They included: 1) tutorials for statistical inference with propensity score methods, 2) preregistration of study protocols and sensitivity analysis for variant decisions, 3) calibration of initial conditions in astronomy research, 4) developing principles for standardized and interoperable data and knowledge representation, and 5) optimizing the choice of heuristics for different types of research instances (Adorf et al., 2018; Brown & Gnedin, 2021; Dunning et al., 2018; Fisher, 2018, 2019; He et al., 2018; Rabbi et al., 2017; Ross et al., 2021; Song et al., 2019). Collectively, they could improve reproducibility and replicability by helping researchers standardize the analytical pipeline and build consistency across studies about how statistical assumptions and analytical decisions are made. For example, Ross et al. (2021) developed their tutorial partly because using propensity score–based methods can help prevent p-hacking and provide consistency across studies on how confounding biases in medical claims data are handled, thus making results more reproducible within a study and replicable across studies.
Comprehensive documentation and sharing of data, code, and workflow to ensure the reproducibility of a specific project. This theme included the largest number of submissions, highlighting the efforts of researchers across domains to archive entire projects. The submissions included examples of complete code sharing and systematic approaches to making code and data easy to access, making the entire analysis available to other researchers, ensuring the computational environment is archived, and making the code useful to a wide audience (Alakwaa & Savelieff, 2019; da Veiga Leprevost et al., 2017; Gagnon-Bartsch & Shem-Tov, 2019; Hunt et al., 2019, 2020; Hunt et al., 2019; Rabbi et al., 2017; Valley et al., 2020). The submissions also illustrated policy and associated resources for data and code sharing, such as the data and code sharing policy and repository of the American Economic Association.
The tools that our researchers shared relied heavily on a few common platforms, notably GitHub and Docker and the use of containers and R packages. Researchers also submitted novel tools that they developed, including a system that automatically captures data transformation and updates the metadata; functions to capture tacit information in data sets and express them explicitly; a customized and complete workflow for data generation, analytics, and sharing; a platform to help researchers choose from a multitude of bioinformatics analytics tools; and a platform to help researchers replicate results from educational research (Adorf et al., 2018; Fisher, 2018, 2019; Gardner et al., 2018; Song et al., 2019; also see MiCV, an online tool for automated single-cell sequencing data processing). In addition, some teams stressed the importance of communication among collaborators and using independent evaluation to ensure the validity of the study design and the code (Waljee et al., 2018; Valley et al., 2020).
Reproducible research with restricted data. When data cannot be openly shared due to privacy concerns, reproducing research studies is more challenging. This not only includes medical data but also a wide range of data about people: student data, personal finance data, and so on. We did not specifically seek reports on this theme, but in hindsight, this should have been a very important consideration. Three of the submissions dealt with this issue through the adoption of differential privacy, the evaluation of predictive models across studies, and effective communication between study teams that attempted to reproduce others’ work (Broen et al., 2021; Gardner et al., 2018; Waljee et al., 2018).
Replication of results in published studies and meta-analysis. Some of the teams reported their effort to replicate results in published works in their fields and understand reasons behind failure of reproducibility (Dunning et al., 2018; Gardner et al., 2018; Zhao et al., 2020).
Some other equally important issues of reproducibility were not touched upon in this Reproducibility Challenge, such as the determination of sources of uncertainty and defining the limit of reproducibility (to what extent different types of results can be reproduced). With our bottom-up approach, we did not aim to be comprehensive. Instead, we hoped, and indeed were able, to identify topics that our researchers focused on. Through this approach, we have gained a good understanding of their priorities and valuable lessons on how we can help them overcome obstacles.
One of our main goals was to identify gaps that hinder data science and AI research reproducibility, whether they originate from poor understanding of the barriers, the lack of tried-and-true workflows and tools, the paucity of guidelines, challenges with restricted data, or lack of incentives. It was clear to us that the challenge reflected the lack of tried-and-true workflows and tools to ensure reproducibility. The vast majority of the submissions focused on the practical aspects of reproducibility: tools and methods to document and to share data, code, and workflow; methods to reduce errors and variations in the study design; methods to validate their own results and others’ results; and how to understand factors that contribute to the irreproducible results. In other words, our Reproducibility Challenge revealed an admirable amount of effort from researchers toward ‘actionable reproducibility.’
Both for the purpose of judging submissions and for helping to prioritize tools for the future, we began to ask a few targeted questions about many of the approaches we encountered, all of which revolved around a central theme: it is inefficient, sometimes error-prone, and often ineffective for researchers to each develop their own tools.
How generalizable are the tools for others to use, or, how interoperable are they? This varied greatly among the submissions. Some tools and platforms were designed for very specific research problems such as simulations for a specific research question in astronomy; others could benefit research with specific data types, in specific research domains (for example, cell biology or student achievement data in MOOC courses) or specific types of analytics (for example, machine learning or optimization). Following this observation, questions arise on how generalizable a tool can be and who should be responsible for generalizing it, and in the next section we will argue that universities and their research institutes should play a role.
How high is the barrier for people to use a certain tool? This concerns how easy it is for others to learn to use the tools. We saw some tools that have been thoroughly developed with documentation and a standard procedure available for users. On the other end of the spectrum, some tools/platforms require new users to explore extensively. In addition, the technical skills needed to explore the tools also vary widely. In some cases, it seems that only those with high levels of expertise can successfully navigate such tools, and that they are bound to have a small following.
How extensively is a tool validated? This also varied greatly. On one extreme, we encountered broken web links and error messages at the very beginning of the workflow; on the other extreme, some tools had been used extensively by collaborators and even competing research groups. Most fell somewhere in between. This is also a major obstacle for researchers to adopt tools that are developed by others. Who is responsible for the validation? In the next section we argue again that universities and their research institutes should take on this responsibility together with the developers of the tools.
How do we avoid black boxes? This might be one of the trickiest questions. A black box might be easy to adopt and requires minimal technical expertise to use. On the other hand, it might also deter users because they cannot validate or modify what they do. There will need to be a balance between having tools that are easy to use and making it possible for users to inspect, validate, and modify such tools.
In light of our observations, we can consider the original question that prompted the Reproducibility Challenge: What is the role of universities and their research institutes to improve research reproducibility? An important role could be to collaborate with the research community and provide resources and efforts to improve and generalize tools and to enable and encourage their wide adoption.
As McNutt argued, ensuring reproducibility requires “self-correction by design” (McNutt, 2020). The roles of researchers, scientific journals, professional societies, and funding agencies in this process have been debated extensively.
Funding agencies, professional societies, and scientific journals are in a great position to promote open science, incentivize reproducible research, and censure questionable practices through stringent requirements for publications and grant applications.
Faculty researchers run small research groups and they may obtain data from nonstandard instruments and in highly specialized ways and report outputs in many formats. Their effort is largely devoted to getting grants and generating publications, and their research funding needs to be spent on producing results. This research model leaves them little in the way of resources or efforts to produce complete documentation and develop and implement best practices. When they develop ways to improve research reproducibility, it is natural that they focus only on what is needed for their projects or their research group, which is more like an academic exercise instead of building a product for ‘mass consumption.’ “Another flaw in the human character is that everybody wants to build and nobody wants to do maintenance,” says Kurt Vonnegut in Hocus Pocus, and in the case of reproducible research, he speaks not only to human nature but also to the reality of the highly competitive research environment.
Such factors contribute to a reality in which vast variability in research reproducibility exists despite numerous guidelines and requirements, because researchers’ individual efforts vary greatly. Some understand the importance of reproducible research but do not have the means to turn such understanding into practice; some employ methods that give questionable outcomes; while some are indeed able to make their research fully reproducible.
We believe that our Reproducibility Challenge revealed a way to help fill this gap, which is to mobilize universities and their research and support units to translate ‘why’ and ‘what’ into ‘how’ by refining tools that were developed by the researchers, rendering them more adaptable and easier for researchers to plug into their workflows (Figure 1). In the case of data science, the research units could include research centers like MIDAS, data offices on campus, and the university libraries that run data repositories. In addition, such units are in a good place to provide training that will enable the wide adoption of such tools. Our online reproducibility resource hub has been viewed ~7,600 times since it went live at the conclusion of the Reproducibility Challenge, which is a strong indication of the need for this resource and for related training.
Just as data access, data repository, and the support for data curation are becoming indispensable components of the research infrastructure across universities, tools for reproducing research projects and for effective data and code sharing should also become standard resources for researchers. Just as researchers can often receive consultation for study designs and statistical analyses, consultation for reproducible research should also be available. And just as many research centers and institutions train researchers to develop data science and AI skills, they could also help researchers build skills to make research more reproducible. Such efforts will not only allow researchers to meet the requirements of funders and journals, but will also elevate the quality of research across an institution and avoid repeatedly reinventing the same tools. Researchers who participated in the Reproducibility Challenge indicated three main reasons why they have been involved in such work: 1) They believe that making their research reproducible is their responsibility as scientists. 2) Developing workflows, tools, and repositories to make their work reproducible, even with significant effort to start with, can greatly improve the efficiency of their own research and that of their collaborators in the long run. 3) They care deeply about the harm that irreproducible results would cause to their field of study. We believe these values are shared widely by researchers, and that universities’ support to ease the hurdle of adopting best practices should meet their enthusiasm.
The universities and their research institutes can play complementary roles. We advocate that universities, together with funding agencies, should invest in resources for research reproducibility the same way they invest in core facilities such as computing facilities and data centers. And this will likely have a strong return on investment, because it will save researchers time that can be spent on acquiring grants and on other projects and because these resources will reassure funding agencies about the quality of the work being done. Research and support units, such as data science centers and library data repositories, can then coordinate their efforts and use the resources to develop ‘products’ for reproducible research and enable their wide adoption.
Recommendation 1. Universities, their data science institutes, and other data science–related units should coordinate to help standardize, generalize, validate, and disseminate the methods and tools researchers produce across application domains, thereby attaining actionable reproducibility. These tasks cannot be accomplished by distributed efforts of individual researchers—some of them lack the expertise to ensure their own research is reproducible; others may have the expertise but seldom have time or funding to make their approaches generalizable or help others adopt them. Universities should regard resources and efforts that support reproducibility as an essential part of the research infrastructure, similar to other core facilities that support critical components of research, such as patient recruitment, data management, data repository, statistical analysis, and computing. Producing these tools should also be viewed as a positive factor for faculty tenure and promotion.
Recommendation 2. Through training and providing examples and protocols, universities and their data science units can help researchers adopt methods and tools to improve the reproducibility of their research. A lack of best practices does not automatically mean that the researchers do not understand the importance of making their work reproducible; rather, it often simply reflects a lack of technical skills, sufficient effort, and resources to develop suitable tools for themselves, or a lack of suitable tools that can be easily adopted. It is impractical to place this burden on individual researchers; instead, systematic effort from universities and their research organizations can go a long way.
Recommendation 3. Provide motivation for researchers to adopt best practices and suitable tools. The ultimate goal of promoting reproducible research is to ensure that researchers can do better science, hopefully with efficiency and effectiveness. All else being equal, reproducible research will have greater scientific impact, which is to the benefit of both the individual researcher and the funders. Promoting best practices and tools for reproducible research will be more effective through the demonstration of how they benefit research in the long run.
Recommendation 4. For all who develop and refine tools with the goal of mass adoption, strike the right balance between avoiding black boxes and making the tools easy to use. This will need to be achieved through first gaining a better understanding of the users’ research goals and their technical skills, followed by mindful practices when developing tools and training researchers to use such tools.
Recommendation 5. Offer consultation to researchers to improve the reproducibility of their research, including recommending tools and reviewing and auditing code and other components in the workflow. This is one goal that MIDAS strives toward, so we do not have experiences to share. Some universities have already started such efforts, for example, through the hiring of ‘reproducibility librarians.’ Again, this will be possible only if universities can invest resources for reproducible research and regard them as indispensable components of the research infrastructure.
We also recognize that the support for making research more reproducible may vary greatly from one institution to another, and from one research center to another, depending on many factors, including the characteristics of the research, the availability of resources, and the skill levels of the researchers these centers work with. Our Reproducibility Challenge is but one small effort to understand the landscape of our researchers’ interest and activity in reproducible research. Each institution and research center will need to identify unique ways to maximize the impact of their effort in this space.
The focus of our next steps is to collaborate with researchers to validate and standardize some of the tools that they developed, make them interoperable for data scientists across research domains, and offer training to investigators to use such tools. We believe this is the crucial component of enabling the wide adoption of reproducible research best practices. Our collection of tools and best practice examples so far is merely the starting point, and it will take concerted effort to transform them into easy-to-follow tools and put them in the hands of researchers with diverse technical backgrounds, employing different terminologies for similar analyses. We are committed to continuing this effort.
We thank Matthew Kay for his work on the coordination of the Reproducibility Challenge and as one of the judges. We also thank all the research teams who entered the Reproducibility Challenge for their effort to make data science research more reproducible and for sharing their approaches and insight with the research community.
Brian Puchala acknowledges support by the U.S. Department of Energy, Office of Basic Energy Sciences, Division of Materials Sciences and Engineering, under award DE-SC0008637 as part of the Center for PRedictive Integrated Structural Materials Science (PRISMS Center).
Aczel, B., Szaszi, B., Sarafoglou, A., Kekecs, Z., Kucharský, Š., Benjamin, D., Chambers, C. D., Fisher, A., Gelman, A., Gernsbacher, M. A., Ioannidis, J. P., Johnson, E., Jonas, K., Kousta, S., Lilienfeld, S. O., Lindsay, D. S., Morey, C. C., Munafò, M., Newell, B. R., . . . Wagenmakers, E.-J. (2020). A consensus-based transparency checklist. Nature Human Behaviour, 4(1), 4–6. https://doi.org/10.1038/s41562-019-0772-6
Adorf, C. S., Dodd, P. M., Ramasubramani, V., & Glotzer, S. C. (2018). Simple data and workflow management with the signac framework. Computational Materials Science, 146, 220–229. https://doi.org/10.1016/j.commatsci.2018.01.035
Alakwaa, F. M., & Savelieff, M. G. (2019). Bioinformatics analysis of metabolomics data unveils association of metabolic signatures with methylation in breast cancer. Journal of Proteome Research, 19(7), 2879–2889. https://doi.org/10.1021/acs.jproteome.9b00755
Alberts, B., Cicerone, R., Fienberg, S., Kamb, A., McNutt, M., Nerem, R., Schekman, R., Shiffrin, R., Stodden, V., Suresh, S., Zuber, M. T., Pope, B. K., & Jamieson, K. H. (2015). Self-correction in science at work: Improve incentives to support research integrity. Science, 348(6242), 1420–1422. https://doi.org/10.1126/science.aab3847
Alter, G., & Gonzalez, R. (2018). Responsible practices for data sharing. American Psychologist, 73(2), 146–156. https://doi.org/10.1037/amp0000258
Baker, M. (2016). 1,500 scientists lift the lid on reproducibility. Nature, 533(7604), 452–454. https://doi.org/10.1038/533452a
Boulbes, D. R., Costello, T., Baggerly, K., Fan, F., Wang, R., Bhattacharya, R., Ye, X., & Ellis, L. M. (2018). A survey on data reproducibility and the effect of publication process on the ethical reporting of laboratory research. Clinical Cancer Research, 24(14), 3447–3455. https://doi.org/10.1158/1078-0432.ccr-18-0227
Brito, J. J., Li, J., Moore, J. H., Greene, C. S., Nogoy, N. A., Garmire, L. X., & Mangul, S. (2020). Recommendations to enhance rigor and reproducibility in biomedical research. GigaScience, 9(6), Article giaa056. https://doi.org/10.1093/gigascience/giaa056
Broen, K., Trangucci, R., & Zelner, J. (2021). Measuring the impact of spatial perturbations on the relationship between data privacy and validity of descriptive statistics. International Journal of Health Geographics, 20, Article 3. https://doi.org/10.1186/s12942-020-00256-8
Brown, G., & Gnedin, O.Y. (2021). Improving performance of zoom-in cosmological simulations using initial conditions with customized grids. New Astronomy, 84(2), Article 101501. https://doi.org/10.1016/j.newast.2020.101501
Collins, F. S., & Tabak, L. A. (2014). Policy: NIH plans to enhance reproducibility. Nature News, 505(7485), 612–613. https://doi.org/10.1038/505612a
da Veiga Leprevost, F., Grüning, B. A., Alves Aflitos, S., Röst, H. L., Uszkoreit, J., Barsnes, H., Vaudel, M., Moreno, P., Gatto, L., Weber, J., Bai, M., Jimenez, R. C., Sachsenberg, T., Pfeuffer, J., Alvarez, R. V., Griss, J., Nesvizhskii, A. I., & Perez-Riverol, Y. (2017). BioContainers: An open-source and community-driven framework for software standardization. Bioinformatics, 33(16), 2580–2582. https://doi.org/10.1093/bioinformatics/btx192
Dunning, I., Gupta, S., & Silberholz, J. (2018). What works best when? A systematic evaluation of heuristics for Max-Cut and QUBO. INFORMS Journal on Computing, 30(3), 608–624. https://doi.org/10.1287/ijoc.2017.0798
Fisher, J. C. (2018). Exit, cohesion, and consensus: Social psychological moderators of consensus among adolescent peer groups. Social Currents, 5(1), 49–66. https://doi.org/10.1177/2329496517704859
Fisher, J. C. (2019). Data-specific functions: A comment on Kindel et al. Socius, 5. https://doi.org/10.1177/2378023118822893
Frery, A. C., Gomez, L., & Medeiros, A. C. (2020). A badging system for reproducibility and replicability in remote sensing research. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 13, 4988–4995. https://doi.org/10.1109/JSTARS.2020.3019418
Gagnon-Bartsch, J., & Shem-Tov, Y. (2019). The classification permutation test: A flexible approach to testing for covariate imbalance in observational studies. The Annals of Applied Statistics, 13(3), 1464–1483. https://doi.org/10.1214/19-AOAS1241
Gardner, J., Brooks, C., Andres, J. M., & Baker, R. (2018). Replicating MOOC predictive models at scale. In R. Luckin, S. Klemmer, & K. R. Koedinger (Eds.), Proceedings of the Fifth Annual ACM Conference on Learning at Scale (Article 1). ACM. https://doi.org/10.1145/3231644.3231656
Goeva, A., Stoudt, S., & Trisovic, A. (2020). Toward reproducible and extensible research: From values to action. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.1cc3d72a
Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., DeMayo, B. E., Long, B., Yoon, E. J., & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: An observational study. Royal Society Open Science, 8(1), Article 201494. https://doi.org/10.1098/rsos.201494
He, Y., Xiang, Z., Zheng, J., Lin, Y., Overton, J. A., & Ong, E. (2018). The eXtensible ontology development (XOD) principles and tool implementation to support ontology interoperability. Journal of Biomedical Semantics, 9(1), Article 3. https://doi.org/10.1186/s13326-017-0169-2
Hunt, G. J., Dane, M. A., Korkola, J. E., Heiser, L. M., & Gagnon-Bartsch, J. A. (2020). Automatic transformation and integration to improve visualization and discovery of latent effects in imaging data. Journal of Computational and Graphical Statistics, 29(4), 929–941. https://doi.org/10.1080/10618600.2020.1741379
Hunt, G. J., Freytag, S., Bahlo, M., & Gagnon-Bartsch, J. A. (2019). Dtangle: Accurate and robust cell type deconvolution. Bioinformatics, 35(12), 2093–2099. https://doi.org/10.1093/bioinformatics/bty926
Kitzes, J., Turek, D., & Deniz, F. (2017). The practice of reproducible research, case studies and lessons from the data-intensive sciences. University of California Press. https://www.ucpress.edu/book/9780520294752/the-practice-of-reproducible-research
Laurinavichyute, A., Yadav, H., & Vasishth, S. (2022). Share the code, not just the data: A case study of the reproducibility of articles published in the Journal of Memory and Language under the open data policy. Journal of Memory and Language, 125. https://doi.org/10.1016/j.jml.2022.104332
Madduri, R., Chard, K., D’Arcy, M., Jung, S.C., Rodriguez, A., Sulakhe, D., Deutsch, E., Funk, C., Heavner, B., Richards, M., Shannon, P., Glusman, G., Price, N., Kesselman, C., & Foster, I. (2019). Reproducible big data science: A case study in continuous FAIRness. PloS One, 14, Article e0213013. https://doi.org/10.1371/journal.pone.0213013
McNutt, M. (2014). Journals unite for reproducibility. Science, 346(6210), 679. https://doi.org/10.1126/science.aaa1724
McNutt, M. (2020). Self-correction by design. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.32432837
Meng, X.-L. (2020). Reproducibility, replicability, and reliability. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.dbfce7f9
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Du Sert, N. P., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1, Article 0021. https://doi.org/10.1038/s41562-016-0021
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. The National Academies Press. https://nap.nationalacademies.org/catalog/25303/reproducibility-and-replicability-in-science
Nosek, B. A., Alter, G., Banks, G. C., Borsboom, D., Bowman, S. D., Breckler, S. J., Buck, S., Chambers, C. D., Chin, G., & Christensen, G., Contestabile, M.., Dafoe, A., EichE., Freese, J., Glennerster, R., Goroff, D., Green, D. P., Hesse, B., & Humphreys, M. (2015). Promoting an open research culture. Science, 348(6242), 1422–1425. https://doi.org/10.1126/science.aab2374
Nosek, B. A., Hardwicke, T. E., Moshontz, H., Allard, A., Corker, K. S., Almenberg, A. D., Fidler, F., Hilgard, J., Kline, M., Nuijten, M. B., Rohrer, J. M., Romero F., Scheel, A. M., Scherer, L., Schönbrodt, F., & Vazire, S. (2021). Replicability, robustness, and reproducibility in psychological science. PsyArXiv. https://doi.org/10.31234/osf.io/ksfvq
Rabbi, M., Philyaw-Kotov, M., Klasnja, P., Bonar, E., Nahum-Shani, I., Walton, M., & Murphy, S. (2017). SARA-Substance Abuse Research Assistant. https://clinicaltrials.gov/ct2/show/NCT03255317
Ross, R. D., Shi, X., Caram, M. E., Tsao, P. A., Lin, P., Bohnert, A., Zhang, M., & Mukherjee, B. (2021). Veridical causal inference using propensity score methods for comparative effectiveness research with medical claims. Health Services and Outcomes Research Methodology, 21(2), 206–228. https://doi.org/10.1007/s10742-020-00222-8
Song, J., Alter, G., & Jagadish, H. (2019). C2Metadata: Automating the capture of data transformations from statistical scripts in data documentation. In P. Boncz & S. Manegold (Eds.), SIGMOD/PODS '19: Proceedings of the 2019 International Conference on Management of Data (pp. 2005–2008). ACM. https://doi.org/10.1145/3299869.3320241
Stodden, V. (2020). The data science life cycle: A disciplined approach to advancing data science as a science. Communications of the ACM, 63(7), 58–66. https://doi.org/10.1145/3360646
Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences, 115(11), 2584–2589. https://doi.org/10.1073/pnas.1708290115
Valley, T. S., Kamdar, N., Wiitala, W. L., Ryan, A. M., Seelye, S. M., Waljee, A. K., & Nallamothu, B. K. (2020). Statistical code sharing: A guide for clinical researchers. Unpublished manuscript.
Waljee, A. K., Lipson, R., Wiitala, W. L., Zhang, Y., Liu, B., Zhu, J., Wallace, B., Govani, S. M., Stidham, R. W., Hayward, R., & Higgins, P. D. R. (2018). Predicting hospitalization and outpatient corticosteroid use in inflammatory bowel disease patients using machine learning. Inflammatory Bowel Diseases, 24(1), 45–53. https://doi.org/10.1093/ibd/izx007
Waller, L. A., & Miller, G. W. (2016). More than manuscripts: Reproducibility, rigor, and research productivity in the big data era. Toxicological Sciences, 149(2), 275–276. https://doi.org/10.1093/toxsci/kfv330
Zhao, Y., Sampson, M. G., & Wen, X. (2020). Quantify and control reproducibility in high-throughput experiments. Nature Methods, 17(12), 1207–1213. https://doi.org/10.1038/s41592-020-00978-4
©2022 Jing Liu, Jacob Carlson, Joshua Pasek, Brian Puchala, Arvind Rao, and H. V. Jagadish. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.