Data management, which encompasses activities and strategies related to the storage, organization, and description of data and other research materials, helps ensure the usability of data sets—both for the original research team and for others. For librarians and other service providers, describing data management as an integral part of a research process or workflow may help contextualize the importance of related resources, practices, and concepts for researchers who may be less familiar with them. Because data management practices and strategies overlap with those related to research reproducibility and open science, presenting these concepts together may help researchers realize their benefit at the laboratory level.
Keywords: research data management, open science, reproducibility, data sharing
The term ‘research data management’ covers a range of activities related to how researchers save, organize, and describe the materials they work with over the course of a research project. Though often overlooked in formal coursework, data management is a necessary part of maintaining a record of what was done, when, and by whom. In this article, we discuss the connections between data management, reproducibility in science, and open science and provide some guidance for realizing the value of data management from the laboratory perspective.
Research data management includes a broad range of activities and strategies related to the storage, organization, and description of data and related research materials. Data management–related practices overlap substantially with those involved in data wrangling and data curation, which similarly involve rendering data sets into forms that ensure their usability by researchers and computational tools. But data management is broader, encompassing practices throughout the entire research lifecycle—from planning that occurs well before data is acquired through the stewardship of data and other materials well after the conclusion of a research effort.
A typical workshop introducing data management to researchers may cover the importance of applying standardized file names and directory organizational schemes, best practices related to documentation, standards, and metadata, discussion of how to choose appropriate methods for backing up and archiving data, review of data management–related mandates and requirements (e.g., data management plans), and discussion of issues related to data security, privacy, licensing, and citation. Depending on the focus of the workshop, software-related practices such as version control, managing dependencies, and sharing code and computational environments may also be included. This broad range of topics reflects the importance of data management at two distinct but interrelated levels:
For individuals and teams who need to keep track of data, materials, processes, and procedures over the course of a given research effort.
For the broader research enterprise that is invested in ensuring the integrity of the research process and reusability of data sets.
For individual researchers and teams, data management has an array of immediate and longer-term benefits. Good data management helps with quality control and prevents data from being lost or made inaccessible. Proper data management also increases the efficiency of the research process and collaborative work as every member of a research team is able to access and use the data, code, documentation, and other materials they need. For the broader research community, well-managed data sets can be more easily examined, (re)used, and built upon than those that have not been well organized or lack sufficient documentation. Because it is a necessary component of establishing what was done, when, and by whom, data management is also integral to establishing a record of the research process—a prerequisite for ensuring research integrity.
Data management is also, increasingly, a requirement. In 2023, the National Institutes of Health (NIH) will implement an updated data management and sharing policy (National Institutes of Health, 2020). Like similar policies implemented by other research funders including the National Science Foundation (NSF) and Patient-Centered Outcomes Research Institute (PCORI), this policy will require that researchers submit a data management plan—a short document outlining how they plan to manage their data and make it available to others—as part of any grant proposal. Other research-data stakeholders, including scholarly publishers and research institutions, have also begun to implement policies related to data management. For example, the Stanford University data retention policy stipulates that primary investigators “Should adopt an orderly system of data organization and should communicate the chosen system to all members of a research group and to the appropriate administrative personnel” (Stanford University, 1997).
Though the emphasis of the NIH policy on “maximizing the appropriate sharing of data” (National Institutes of Health, 2020) represents a significant step forward for the availability of data and other materials in the biomedical and health sciences, the policy landscape related to research data remains quite heterogeneous between and even within different data stakeholders (see Briney et al., 2015; Gaba et al., 2020). To encompass a wide range of practices and standards developed across the research community, implementation details are often left relatively general. However, because data management requires researchers to think prospectively about their practices, it provides an opportunity for researchers, librarians, policymakers, and other stakeholders to promote emerging scholarly communication practices such as the open sharing of articles, data sets, code, and other materials.
Data management is an iterative and continuous process: related practices are implemented during the day-to-day course of a research effort, and decisions made at early stages substantially affect what can be done later. A growing body of guidance and resources has been developed for researchers who need to manage the data and other materials associated with their work (e.g., Borghi et al., 2018; Briney et al., 2020; Broman & Woo, 2018; A. Goodman et al., 2014; Stoudt et al., 2021; Wilson et al., 2017). However, implementing proper data management is often quite complex in practice.
Consider as an example an experiment in which a human participant responds to a series of stimuli presented on a computer. Even this relatively straightforward setup presents substantial challenges for researchers trying to manage their data and other materials. Output files, containing response data for each participant, need to be cleaned and combined prior to analysis. Research software, which may include code created or adapted by the research team, is needed to present the stimuli, record participant responses, conduct data analyses, and create visualizations. Depending on the experiment, data and code may be accompanied by other research objects, including consent forms, study stimuli, or paper surveys or questionnaires. Making use of all this requires documentation which, at the very least, describes data collection and analysis procedures (i.e., protocols), the contents of data files (i.e., data dictionaries), and details of the computational environment needed to run related code and software.
For the research team, all of these objects—data sets, code, research materials, documentation, and so on—need to be properly managed even if there is no intention of sharing them outside of the research group. Doing so properly requires extensive planning and, as summarized in Table 1, consideration of manifold issues and activities. Despite the importance of data management, related practices and strategies are often not covered in coursework (Tenopir et al., 2016) and are instead learned informally from peers (Borghi & Van Gulick, 2018, 2021). As a result, even when individual researchers exercise relatively good data management, their practices may not be standardized between projects or within their own research group.
Project Planning | Data Collection and Analysis | Data Publication and Sharing | |
---|---|---|---|
Data Management Activities | Data Management Planning | Saving and backing up files Organizing files Formatting and describing data according to standards Maintaining documentation and metadata | Preserving data and other materials (e.g. reagents, code) Assigning persistent identifiers |
Open Science Activities | Planning for open (e.g., including data sharing in consent forms) | Using open source tools Using transparent methods and protocols | Sharing data Publishing research reports openly (e.g., open access publishing). |
Reproducibility-Related Activities | Preregistering study aims and methods Using appropriate research designs (including sufficient statistical power) | Preventing methodological issues (e.g., p-hacking, HARKing) Implementing quality control measures | Preventing publication bias Following reporting guidelines |
Note. The terms describing different stages are defined quite broadly. Project planning encompasses both the development of research questions and practical steps. Data collection and analysis could involve researchers generating data or acquiring data initially collected by others. Data publication and sharing may involve describing results in a manuscript or sharing data sets through a repository. This list is not exhaustive but is intended to demonstrate that activities related to data management, reproducibility, and open science are contiguous.
Academic libraries have positioned themselves as a source of guidance for researchers on topics related to data management (Cox et al., 2019; Tang & Hu, 2019; Tenopir et al., 2014) as have other stakeholders, including scholarly publishers, funding agencies, and data repositories. However, librarians, researchers, and other data stakeholders have different perspectives and incentives, which complicates communication about standards and best practices. For example, a librarian may recommend that a researcher deposit data and code in repositories with robust preservation strategies where materials are described with rich metadata and assigned persistent identifiers to help ensure discoverability and citability. A researcher who typically works with data they have acquired themselves and is primarily incentivized to publish high-impact papers may lack the context necessary to appreciate the importance of such practices, or may see them as unnecessary extra steps.
One entry point to overcoming differences in perspective when discussing data management is by focusing on research processes or workflows. In this context, a workflow is defined as the series of programmatic steps or practical ‘ways of doing things’ as data is collected, processed, and analyzed. As shown in Figure 1, there are multiple points in any research effort where researchers make decisions that can significantly affect their results. A research workflow specifies the steps a research team actually takes among many possibilities and directions not taken. Though the term most often refers to data cleaning and analysis procedures, a well-defined workflow may also include standard operating procedures for organizing files, preserving backups, maintaining documentation (e.g., protocols, data dictionaries) and research-related code, as well as guidelines for how the data should be made available to others (e.g., how it should be formatted, described, and organized, and what repositories should be used).
Using research workflows to frame data management puts related practices into a context familiar to researchers. Ideally, much of a workflow is planned in advance and builds toward fulfilling the goals of the research effort. Incorporating data management into research workflows may help related practices be seen as part of the day-to-day task of doing research rather than ‘extra work.’
For librarians and other stakeholders, conceptualizing the research workflow as the locus of support can help reinforce that research data management is about data and other materials that are in process, meaning that they are acquired, processed, analyzed, archived, and shared in order to be acted upon in some way. ‘Best practices’ from the data stewardship perspective need to be balanced with a constellation of other motivations, incentives, and needs, such that the resulting ‘good enough’ practices enable the research team to complete all of their goals, including the curation and dissemination of accessible and (re)usable data sets.
Critical to the realization of the benefits of data management, both for a research team and for the research enterprise more generally, is the development and adoption of standards. A standard specifies how exactly data and related materials should be stored, organized, and described. In the context of research data management, the term typically refers to the use of specific and well-defined formats, schemas, vocabularies, and ontologies in the description and organization of data. However, for researchers within a community where more formal standards have not been well established, it can also be interpreted more broadly to refer to the adoption of the same (or similar) data management-related activities or strategies by different researchers and across different projects.
Formal data standards are developed and maintained by an array of data stakeholders, including the research community itself. For example, the neuroimaging community developed the Brain Imaging Data Structure (BIDS) (RRID:SCR_016124; Gorgolewski et al., 2016) to standardize the description and organization of raw MRI data. BIDS was subsequently extended to other imaging modalities (i.e., EEG, MEG, PET) and is now integral to an ecosystem of quality assurance, processing, and analysis tools (Gorgolewski et al., 2017). The International Neuroinformatics Coordinating Facility (INCF), a standards organization dedicated to open and FAIR neuroscience (Abrams et al., 2021), provides support for BIDS and formally endorsed it as a standard in 2019.
The use of common data elements (CDEs), wherein well-defined questions (variables) are paired with a discrete set of allowable responses (values) that are used in standardized ways across different research efforts, is particularly relevant in light of the new NIH data management and sharing policy. The promise of CDEs is that their use can facilitate comparisons between studies and simplify data aggregation and meta-analysis (Sheehan et al., 2016). To this end, NIH maintains a CDE repository to provide access to structured definitions of the data elements recommended by its institutes and centers as well as other organizations.
Though they are not strictly standards, the FAIR (findability, accessibility, interoperability, and reusability) guiding principles (Mons et al., 2017; Wilkinson et al., 2016) also provide a starting point for working through considerations related to data management and sharing. Because the FAIR principles were developed to describe the characteristics that data-related infrastructure should adopt to facilitate data reuse, Table 2 outlines how the principles can inform data management practices implemented within a given research effort.
FAIR Principle | For Infrastructure | For Researchers |
---|---|---|
Findable | Data and metadata should be easy to find by both humans and computers. In practice, this means that data should be assigned unique and persistent identifiers, described using rich metadata, and registered or indexed in a searchable resource. | Research teams should implement standardized practices related to organizing files (e.g., standardized file naming conventions) so data can be found when needed. When data is made available to others, it should—whenever possible—be uploaded to a repository that assigns a persistent identifier (e.g., DOIs, RRIDs, etc.) and describes data sets with standardized metadata. Complete and high-quality metadata should be added so data can be discovered and linked to related resources (e.g., related paper DOIs, author ORCIDs). |
Accessible | There is a clearly defined method for accessing the data. Data should be retrievable by its identifier using a standardized protocol that is open, free, and universally implementable. Metadata should be accessible even when data is no longer available. | Data is available through a clearly defined process. Members of the research team should be able to access raw data, intermediate products, and other research materials. When data and other materials are made available to others, there should be a clear path to gaining access. The terms by which the data will be made available (e.g., to whom, when, and for what purpose) should be articulated and abided by. |
Interoperable | Data should be usable across a range of applications and workflows. Data should use formal, accessible, shared, and broadly applicable models for knowledge representation. | Data should be structured in a standard way so it can be easily combined with other similarly structured data sets. In practice, this means implementing a range of practices such as describing and organizing data (e.g., applying appropriate metadata, maintaining data dictionaries) and saving files in open or nonproprietary file formats. |
Reusable | Metadata and data should be described following relevant community standards and have clearly defined conditions for reuse including a machine-readable license | Data should be saved, organized, and described with its future (re)use in mind. A future user may be a member of the research team who is returning to the data after a period of several months or years or another researcher who is (re)using the data for another purpose. |
Note. The FAIR principles were initially developed to apply to data-related infrastructure and emphasize machine actionability. However, they imply a number of data management–related activities and strategies that can be implemented by researchers in their day-to-day work. Additional information about choosing the appropriate repository for a data set can be found in Figure 2.
Data repositories play a key role in maintaining and promoting standards. BIDS is central to the OpenNeuro Repository (Markiewicz et al., 2021), which facilitates the sharing and reuse of neuroimaging data. Similarly, the Inter-university Consortium for Political and Social Research (ICPSR), promotes the Data Documentation Initiative (DDI) (Vardigan et al., 2008) as a standard for survey data. Figure 2 outlines information for identifying an appropriate repository to deposit a given data set. A similar representation, which emphasizes the use resources developed and implemented by relevant research communities, could also be used for identifying which data standards should be used to structure and describe a given data set.
Resources like FAIRsharing.org and the Registry of Research Data Repositories provide information related to the standards and repositories for specific data types and disciplinary communities. Similarly, the forthcoming NIH data policy (National Institutes of Health, 2020) includes a set of “desirable characteristics for all research repositories” (i.e., assigns persistent identifiers, has a plan for long-term sustainability) to inform decisions about platforms for managing and sharing data resulting from federally funded research.
In research areas for which standards or best practices have not been developed or widely adopted, discussion of data management may be relatively nascent. Even with guidance from a data librarian or other data management experts, developing a research team’s data management practices can be an overwhelming experience, but could begin with a conversation catalyzed by questions such as:
Would every member of the research team be able to find and use the data, code, documentation, and other materials related to this project?
Would another researcher who works in the same field be able to find and use the data, code, documentation, and other materials related to this project?
Ten years from now, would you or another researcher be able to find and use the data, code, documentation, and other materials related to this project?
These questions are drawn from a longer checklist, which can be found in full in the Supplementary Files of this document. The complete checklist was initially developed to accompany one-time data management workshops attended by researchers. The intent was for workshop attendees to use questions such as these as a starting point for discussing data management–related practices with their research groups. This approach has subsequently formed the basis of focus groups and survey instruments examining data management–related practices, perceptions, and needs within Stanford University’s School of Medicine. Answering in the affirmative to all three questions or even all the items on the checklist does not indicate that a given researcher or research group is engaged in optimal data management. Instead, the questions provide an opportunity to consider gaps in current practices.
For both individual research teams and the research enterprise more broadly, one of the most cited benefits of good data management is the foundational role that related practices play in establishing reproducibility. Efforts to address reproducibility are generally concerned with establishing the credibility, reliability, and validity of scientific research (S. N. Goodman et al., 2016; LeBel et al., 2018; Peng & Hicks, 2021; Peterson & Panofsky, 2021; Plesser, 2018; see also Devezer et al., 2021) and, as demonstrated in Table 1, address a wide array of issues related to study design, analytical practices, and the communication of research results.
Methods for addressing reproducibility-related concerns often center on researchers communicating the details of their research process. This includes both enhanced transparency for study-related decisions and procedures and appropriate sharing of research materials, including data sets. Lack of transparency has substantial consequences for reproducibility. The effect of the phenomenon shown in Figure 1, commonly described as ‘researcher degrees of freedom’ (Simmons et al., 2011) in reproducibility-related literature, has been demonstrated empirically by initiatives where multiple research teams independently examine the same data set (e.g., Botvinik-Nezer et al. 2020; Silberzahn et al., 2018). In the absence of extensive documentation, teams of experts may construct different workflows that lead to different results from the same data. This underlines the necessity of being able to trace the exact process used to get to a set of research results.
While a full accounting of every practice proposed to address reproducibility over the course of a research effort is beyond the scope of this manuscript (see National Academies of Sciences, Engineering, and Medicine [NASEM], 2019), many of these practices overlap with those that fall under the umbrella of research data management and similarly involve careful consideration of issues throughout the research process. For example, instruments such as the Materials Design Analysis Reporting (MDAR) Checklist (Macleod et al., 2021) are intended to address reproducibility-related concerns by enabling unambiguous descriptions of what research materials were used (e.g., reagents, model organisms) and how data, code, and other elements of the research process can be accessed. While such checklists are typically completed at the time of publication, completing them requires researchers to have carefully documented their processes and materials throughout the research process.
Beyond checklists and other individual interventions, proper data management is essential to establishing an audit trail or record of the research process. In practice, many of the same activities and strategies that help researchers keep track of their data, code, materials, practices, and procedures during the day-to-day course of working on a research project also help to ensure that their processes and conclusions are reproducible.
For an individual research team, it may not be completely clear how to establish reproducibility, even internally. One starting point would be to pursue a narrow definition of the term, where a research effort is said to be reproducible if the same results are found when the same analytical pipeline is applied to the same data. Often called computational or analytic reproducibility (LeBel et al., 2018; Stodden et al., 2018), this can still be difficult in practice. In computationally intensive research, changes in software version and operating system can have measurable effects on research results (e.g., Gronenschild et al., 2012). Therefore, achieving computational reproducibility may require not just careful organization and documentation but also the application of tools such as software containers that include the code needed to reproduce the analysis, as well as the specific software version and operating system used (Grüning et al., 2018).
Establishing reproducibility within the research enterprise more broadly requires the adoption of a range of practices, many of which fall under the umbrella of open science (Munafo et al., 2017; NASEM, 2018, 2019). The term open science covers a variety of efforts focused on making scientific research more transparent and accessible. Though it is frequently used to refer to efforts aimed at ensuring access to tools and research products, open science also encompasses efforts to ensure that the scientific enterprise is inclusive and equitable. Such efforts are interrelated but, for the purposes of this review, we are focused primarily on openness for research objects (e.g., software, publications, protocols, data sets) (see Table 1). Related practices exist along a continuum, meaning that it is generally more accurate to describe a research effort’s degree of openness rather than categorizing it as simply ‘open’ or ‘closed.’
As of this writing, perhaps the clearest demonstration of the immense value of open science is the publication of the complete SARS-CoV-2 genome, first on the Virological discussion forum and subsequently on Genbank (Wu et al., 2020), which catalyzed efforts to create tests and interventions targeting the disease. Moderna’s COVID-19 vaccine, based on this genome, was first sequenced just days after the initial posting (Moderna Therapeutics, 2021).
For individual researchers and teams, adopting open science practices has a number of potential benefits, including exposure to new tools and methods and streamlining the research process (Allen & Mehler, 2019; Lowndes et al., 2017). Returning to the example of an experiment where human participants respond to stimuli on a computer screen, the research team may develop open source tools to collect and analyze their data, publish detailed information about their protocols and analytical pipelines so they can be verified and built upon by others, post a preprint to quickly disseminate their findings, disseminate their work broadly through using one of several routes to open access, or make data, code, stimuli, or other materials available through an open repository. Each of these practices has benefits for the research enterprise—open science can quicken the pace of research, and the reuse of shared materials has innumerable downstream benefits. However, depending on their needs and priorities, the benefits of open science may not be immediately apparent to a team in the process of conducting research.
Despite the benefits, open science can be difficult to put into practice. Implementing all the practices outlined in Table 1 effectively requires planning. For example, sharing data from human participants requires consideration of how to handle personally identifying information throughout the entire research process. Depending on the nature of the data, deidentification or anonymization may not be possible (e.g., Rocher et al., 2019) meaning that, while it still may be possible to provide access to the data to certain individuals under certain circumstances, it cannot be shared publicly.
As with data management, open data sharing requires consideration of issues that may be outside of a research team’s expertise. There may or may not be discipline- or data type- specific standards for exactly what data should be openly shared (i.e., raw data vs. processed data), what format it should be shared in, or how it should be licensed and made available. There is also a difference between data that are openly available and data that have been shared in a truly usable form. Absent guidance from data management experts, interventions targeted at making data and other materials available may not necessarily result in data and other materials being shared in a (re)usable form. Data sets may need to be made available alongside code, explanations, and other elements of the research process to be reproducible (Chen et al., 2019).
There has been extensive research into why researchers do and do not share their data openly, and the major themes that arise are lack of time and skills necessary to organize the data into a form suitable for sharing (see Perrier et al., 2020). This supports the notion that, for many research communities, data sharing and other open science-related activities may not be part of a research team’s regular workflow and be seen as ‘extra work’ that is not rewarded by institutions or funders or in the hiring or promotion process.
For researchers, librarians, and other data stakeholders, data management practices provide an entry point for promoting open science practices like data sharing. Data management can be positioned as a solution to immediate needs, such as ensuring that data is accessible and secure. But the same practices that help ensure data and other materials are usable by the research team as they are working with them also make the process of sharing them openly substantially more efficient. For example, outlining methods for maintaining internal documentation about study protocols provides an opportunity to promote sharing them through a tool like Protocols.io, discussing solutions for the long-term preservation of data and code provides an opportunity to promote publishing them in discipline-specific or generalist repositories and providing guidance on how data and other research products should be cited in published literature provides an opportunity to promote persistent identifiers (e.g., DOIs) as well as the posting of preprints and open access to research articles.
Data management, reproducibility, and open science are interrelated. Any one of them can be an angle for promoting the others based on a research team’s existing workflow, priorities, and needs. On their own, implementing good data management practices is not sufficient to establish reproducibility. However, a set of research results cannot be efficiently examined or replicated if the underlying data, code, and other materials were not properly saved and organized, and analysis-related procedures and decisions were incompletely documented. For a given experiment, it is also unlikely that all of the data, code, documentation, and other materials need to be shared. Depending on the nature of the research effort and the conclusions being drawn, it is possible that only the raw data, only the ‘final’ fully processed data set, or a selection of intermediate data products need to be shared to establish reproducibility. However, to establish a record of what was done, when, why, and by whom, all of it needs to be well managed and stewarded throughout the research process.
On their own, policies related to data management and data sharing have had mixed success in ensuring data is made available in a usable form (Couture et al., 2018; Federer et al., 2018; Parham et al., 2016). Even when data and other materials are ostensibly available, they may not in fact be ‘available upon request’ when requested (Stodden et al., 2018; Vines et al., 2014), not actually deposited in a repository (Danchev et al., 2021; Van Tuyl & Whitmire, 2016), or not shared in a usable or reproducible form (Hardwicke et al., 2018, 2021). Similarly, when presented in isolation, activities related to data management, reproducibility, and open science may not resonate with researchers who have different motivations and incentives.
Reproducibility and open science begin in the laboratory, with practices implemented by researchers during the day-to-day course of a research effort. Visualizations such as the research data lifecycle (Ball et al., 2012; Griffin et al., 2018) are often used to describe breadth of data management activities or strategies, but situating data management instead as an integral part of a research workflow may help contextualize such practices and prevent them from being seen as just extra work to be done at a discrete point in the research process. Grounding such practices in a context that researchers are familiar with also provides an opportunity for promoting other practices that may involve issues that a research team may not be familiar with, including automated workflows, large-scale data reuse, and open source research infrastructure.
Both authors work broadly in the field of data management and sharing. A.E.V. is currently employed by the commercial company Figshare. Support from this employer was provided in the form of the author’s salary, but the employer has not influenced the development or content of this project nor the decision to publish this work.
Abrams, M. B., Bjaalie, J. G., Das, S., Egan, G. F., Ghosh, S. S., Goscinski, W. J., Grethe, J. S., Kotaleski, J. H., Ho, E. T. W., Kennedy, D. N., Lanyon, L. J., Leergaard, T. B., Mayberg, H. S., Milanesi, L., Mouček, R., Poline, J. B., Roy, P. K., Strother, S. C., Tang, T. B., … Martone, M. E. (2021). A standards organization for open and FAIR neuroscience: The International Neuroinformatics Coordinating Facility. Neuroinformatics. https://doi.org/10.1007/s12021-020-09509-0
Allen, C., & Mehler, D. M. A. (2019). Open science challenges, benefits and tips in early career and beyond. PLOS Biology, 17(5), Article e3000246. https://doi.org/10.1371/journal.pbio.3000246
Ball, A. (2012). Review of data management lifecycle models. https://researchportal.bath.ac.uk/en/publications/review-of-data-management-lifecycle-models
Borghi, J. A., Abrams, S., Lowenberg, D., Simms, S., & Chodacki, J. (2018). Support your data: A research data management guide for researchers. Research Ideas and Outcomes, 4, Article e26439. https://doi.org/10.3897/rio.4.e26439
Borghi, J. A., & Van Gulick, A. E. (2018). Data management and sharing in neuroimaging: Practices and perceptions of MRI researchers. PLOS ONE, 13(7), Article e0200562. https://doi.org/10.1371/journal.pone.0200562
Borghi, J. A., & Van Gulick, A. E. (2021). Data management and sharing: Practices and perceptions of psychology researchers. PLOS ONE, 16(5), Article e0252047. https://doi.org/10.1371/journal.pone.0252047
Botvinik-Nezer, R., Holzmeister, F., Camerer, C. F., Dreber, A., Huber, J., Johannesson, M., Kirchler, M., Iwanir, R., Mumford, J. A., Adcock, R. A., Avesani, P., Baczkowski, B. M., Bajracharya, A., Bakst, L., Ball, S., Barilari, M., Bault, N., Beaton, D., Beitner, J., … Schonberg, T. (2020). Variability in the analysis of a single neuroimaging dataset by many teams. Nature, 582(7810), 84–88. https://doi.org/10.1038/s41586-020-2314-9
Briney, K., Coates, H., & Goben, A. (2020). Foundational practices of research data management. Research Ideas and Outcomes, 6, Article e56508. https://doi.org/10.3897/rio.6.e56508
Briney, K., Goben, A., & Zilinski, L. (2015). Do you have an institutional data policy? A review of the current landscape of library data services and institutional data policies. Journal of Librarianship and Scholarly Communication, 3(2), Article eP1232. https://doi.org/10.7710/2162-3309.1232
Broman, K. W., & Woo, K. H. (2018). Data organization in spreadsheets. The American Statistician, 72(1), 2–10. https://doi.org/10.1080/00031305.2017.1375989
Chen, X., Dallmeier-Tiessen, S., Dasler, R., Feger, S., Fokianos, P., Gonzalez, J. B., Hirvonsalo, H., Kousidis, D., Lavasa, A., Mele, S., Rodriguez, D. R., Šimko, T., Smith, T., Trisovic, A., Trzcinska, A., Tsanaktsidis, I., Zimmermann, M., Cranmer, K., Heinrich, L., … Neubert, S. (2019). Open is not enough. Nature Physics, 15(2), 113–119. https://doi.org/10.1038/s41567-018-0342-2
Couture, J. L., Blake, R. E., McDonald, G., & Ward, C. L. (2018). A funder-imposed data publication requirement seldom inspired data sharing. PLOS ONE, 13(7), Article e0199789. https://doi.org/10.1371/journal.pone.0199789
Cox, A. M., Kennan, M. A., Lyon, L., Pinfield, S., & Sbaffi, L. (2019). Maturing research data services and the transformation of academic libraries. Journal of Documentation, 75(6), 1432–1462. https://doi.org/10.1108/JD-12-2018-0211
Danchev, V., Min, Y., Borghi, J. A., Baiocchi, M., & Ioannidis, J. P. A. (2021). Evaluation of data sharing after implementation of the International Committee of Medical Journal Editors Data Sharing Statement Requirement. JAMA Network Open, 4(1), Article e2033972. https://doi.org/10.1001/jamanetworkopen.2020.33972
Devezer, B., Navarro, D. J., Vandekerckhove, J., & Ozge Buzbas, E. (2021). The case for formal methodology in scientific reform. Royal Society Open Science, 8(3), Article rsos.200805. https://doi.org/10.1098/rsos.200805
Federer, L. M., Belter, C. W., Joubert, D. J., Livinski, A., Lu, Y.-L., Snyders, L. N., & Thompson, H. (2018). Data sharing in PLOS ONE: An analysis of data availability statements. PLOS ONE, 13(5), Article e0194768. https://doi.org/10.1371/journal.pone.0194768
Gaba, J. F., Siebert, M., Dupuy, A., Moher, D., & Naudet, F. (2020). Funders’ data-sharing policies in therapeutic research: A survey of commercial and non-commercial funders. PLOS ONE, 15(8), Article e0237464. https://doi.org/10.1371/journal.pone.0237464
Goodman, A., Pepe, A., Blocker, A. W., Borgman, C. L., Cranmer, K., Crosas, M., Di Stefano, R., Gil, Y., Groth, P., Hedstrom, M., Hogg, D. W., Kashyap, V., Mahabal, A., Siemiginowska, A., & Slavkovic, A. (2014). Ten simple rules for the care and feeding of scientific data. PLOS Computational Biology, 10(4), Article e1003542. https://doi.org/10.1371/journal.pcbi.1003542
Goodman, S. N., Fanelli, D., & Ioannidis, J. P. A. (2016). What does research reproducibility mean? Science Translational Medicine, 8(341), Article 341ps12. https://doi.org/10.1126/scitranslmed.aaf5027
Gorgolewski, K. J., Auer, T., Calhoun, V. D., Craddock, R. C., Das, S., Duff, E. P., Flandin, G., Ghosh, S. S., Glatard, T., Halchenko, Y. O., Handwerker, D. A., Hanke, M., Keator, D., Li, X., Michael, Z., Maumet, C., Nichols, B. N., Nichols, T. E., Pellman, J., … Poldrack, R. A. (2016). The brain imaging data structure, a format for organizing and describing outputs of neuroimaging experiments. Scientific Data, 3, Article 160044. https://doi.org/10.1038/sdata.2016.44
Gorgolewski, K. J., Alfaro-Almagro, F., Auer, T., Bellec, P., Capotă, M., Chakravarty, M. M., Churchill, N. W., Cohen, A. L., Craddock, R. C., Devenyi, G. A., Eklund, A., Esteban, O., Flandin, G., Ghosh, S. S., Guntupalli, J. S., Jenkinson, M., Keshavan, A., Kiar, G., Liem, F., … Poldrack, R. A. (2017). BIDS apps: Improving ease of use, accessibility, and reproducibility of neuroimaging data analysis methods. PLOS Computational Biology, 13(3), Article e1005209. https://doi.org/10.1371/journal.pcbi.1005209
Griffin, P. C., Khadake, J., LeMay, K. S., Lewis, S. E., Orchard, S., Pask, A., Pope, B., Roessner, U., Russell, K., Seemann, T., Treloar, A., Tyagi, S., Christiansen, J. H., Dayalan, S., Gladman, S., Hangartner, S. B., Hayden, H. L., Ho, W. W. H., Keeble-Gagnère, G., … Schneider, M. V. (2018). Best practice data life cycle approaches for the life sciences. F1000Research, 6, Article 1618. https://doi.org/10.12688/f1000research.12344.2
Gronenschild, E. H. B. M., Habets, P., Jacobs, H. I. L., Mengelers, R., Rozendaal, N., Os, J. van, & Marcelis, M. (2012). The effects of FreeSurfer version, workstation type, and Macintosh operating system version on anatomical volume and cortical thickness measurements. PLOS ONE, 7(6), Article e38234. https://doi.org/10.1371/journal.pone.0038234
Grüning, B., Chilton, J., Köster, J., Dale, R., Soranzo, N., van den Beek, M., Goecks, J., Backofen, R., Nekrutenko, A., & Taylor, J. (2018). Practical computational reproducibility in the life sciences. Cell Systems, 6(6), 631–635. https://doi.org/10.1016/j.cels.2018.03.014
Hardwicke, T. E., Bohn, M., MacDonald, K., Hembacher, E., Nuijten, M. B., Peloquin, B. N., deMayo, B. E., Long, B., Yoon, E. J., & Frank, M. C. (2021). Analytic reproducibility in articles receiving open data badges at the journal Psychological Science: An observational study. Royal Society Open Science, 8(1), Article 201494. https://doi.org/10.1098/rsos.201494
Hardwicke, T. E., Mathur, M. B., MacDonald, K., Nilsonne, G., Banks, G. C., Kidwell, M. C., Hofelich Mohr, A., Clayton, E., Yoon, E. J., Henry Tessler, M., Lenne, R. L., Altman, S., Long, B., & Frank, M. C. (2018). Data availability, reusability, and analytic reproducibility: Evaluating the impact of a mandatory open data policy at the journal Cognition. Royal Society Open Science, 5(8), Article 180448. https://doi.org/10.1098/rsos.180448
LeBel, E. P., McCarthy, R. J., Earp, B. D., Elson, M., & Vanpaemel, W. (2018). A unified framework to quantify the credibility of scientific findings. Advances in Methods and Practices in Psychological Science, 1(3), 389–402. https://doi.org/10.1177/2515245918787489
Lowndes, J. S. S., Best, B. D., Scarborough, C., Afflerbach, J. C., Frazier, M. R., O’Hara, C. C., Jiang, N., & Halpern, B. S. (2017). Our path to better science in less time using open data science tools. Nature Ecology & Evolution, 1(6), 1–7. https://doi.org/10.1038/s41559-017-0160
Macleod, M., Collings, A. M., Graf, C., Kiermer, V., Mellor, D., Swaminathan, S., Sweet, D., & Vinson, V. (2021). The MDAR (Materials Design Analysis Reporting) Framework for transparent reporting in the life sciences. Proceedings of the National Academy of Sciences, 118(17), Article e2103238118. https://doi.org/10.1073/pnas.2103238118
Markiewicz, C. J., Gorgolewski, K. J., Feingold, F., Blair, R., Halchenko, Y. O., Miller, E., Hardcastle, N., Wexler, J., Esteban, O., Goncavles, M., Jwa, A., & Poldrack, R. (2021). The OpenNeuro resource for sharing of neuroscience data. ELife, 10, Article e71774. https://doi.org/10.7554/eLife.71774
Moderna Therapeutics. (2021). Moderna’s work on our COVID-19 vaccine. Retrieved August 22, 2021, from https://www.modernatx.com/modernas-work-potential-vaccine-against-covid-19
Mons, B., Neylon, C., Velterop, J., Dumontier, M., da Silva Santos, L. O. B., & Wilkinson, M. D. (2017). Cloudy, increasingly FAIR; revisiting the FAIR Data guiding principles for the European Open Science Cloud. Information Services & Use, 37(1), 49–56. https://doi.org/10.3233/ISU-170824
Munafò, M. R., Nosek, B. A., Bishop, D. V. M., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. P. A. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1(1), 1–9. https://doi.org/10.1038/s41562-016-0021
National Academies of Sciences, Engineering, and Medicine. (2018). Open science by design: Realizing a vision for 21st century research. The National Academies Press. https://doi.org/10.17226/25116
National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. The National Academies Press. https://doi.org/10.17226/25303
National Institutes of Health. (2020). Final NIH policy for data management and sharing. Retrieved August 22, 2021, from https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
Parham, S. W., Carlson, J., Hswe, P., Westra, B., & Whitmire, A. (2016). Using data management plans to explore variability in research data management practices Across Domains. International Journal of Digital Curation, 11(1), 53–67. https://doi.org/10.2218/ijdc.v11i1.423
Peng, R. D., & Hicks, S. C. (2021). Reproducible research: A retrospective. Annual Review of Public Health, 42(1), 79–93. https://doi.org/10.1146/annurev-publhealth-012420-105110
Perrier, L., Blondal, E., & MacDonald, H. (2020). The views, perspectives, and experiences of academic researchers with data sharing and reuse: A meta-synthesis. PLOS ONE, 15(2), Article e0229182. https://doi.org/10.1371/journal.pone.0229182
Peterson, D., & Panofsky, A. (2021). Self-correction in science: The diagnostic and integrative motives for replication. Social Studies of Science, 51(4), 583–605. https://doi.org/10.1177/03063127211005551
Plesser, H. E. (2018). Reproducibility vs. replicability: A brief history of a confused terminology. Frontiers in Neuroinformatics, 11, Article 76. https://doi.org/10.3389/fninf.2017.00076
Rocher, L., Hendrickx, J. M., & de Montjoye, Y.-A. (2019). Estimating the success of re-identifications in incomplete datasets using generative models. Nature Communications, 10(1), Article 3069. https://doi.org/10.1038/s41467-019-10933-3
Sheehan, J., Hirschfeld, S., Foster, E., Ghitza, U., Goetz, K., Karpinski, J., Lang, L., Moser, R. P., Odenkirchen, J., Reeves, D., Rubinstein, Y., Werner, E., & Huerta, M. (2016). Improving the value of clinical research through the use of Common Data Elements. Clinical Trials, 13(6), 671–676. https://doi.org/10.1177/1740774516653238
Silberzahn, R., Uhlmann, E. L., Martin, D. P., Anselmi, P., Aust, F., Awtrey, E., Bahník, Š., Bai, F., Bannard, C., Bonnier, E., Carlsson, R., Cheung, F., Christensen, G., Clay, R., Craig, M. A., Dalla Rosa, A., Dam, L., Evans, M. H., Flores Cervantes, I., … Nosek, B. A. (2018). Many analysts, one data set: Making transparent how variations in analytic choices affect results. Advances in Methods and Practices in Psychological Science, 1(3), 337–356. https://doi.org/10.1177/2515245917747646
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Stanford University. (1997). Retention of and access to research data. Retrieved August 22, 2021, from https://doresearch.stanford.edu/policies/research-policy-handbook/conduct-research/retention-and-access-research-data
Stodden, V., Seiler, J., & Ma, Z. (2018). An empirical analysis of journal policy effectiveness for computational reproducibility. Proceedings of the National Academy of Sciences, 115(11), 2584–2589. https://doi.org/10.1073/pnas.1708290115
Stoudt, S., Vásquez, V. N., & Martinez, C. C. (2021). Principles for data analysis workflows. PLOS Computational Biology, 17(3), Article e1008770. https://doi.org/10.1371/journal.pcbi.1008770
Tang, R., & Hu, Z. (2019). Providing Research Data Management (RDM) services in libraries: Preparedness, roles, challenges, and training for RDM practice. Data and Information Management, 3(2), 84–101. https://doi.org/10.2478/dim-2019-0009
Tenopir, C., Allard, S., Sinha, P., Pollock, D., Newman, J., Dalton, E., Frame, M., & Baird, L. (2016). Data management education from the perspective of science educators. International Journal of Digital Curation, 11(1), 232–251. https://doi.org/10.2218/ijdc.v11i1.389
Tenopir, C., Sandusky, R. J., Allard, S., & Birch, B. (2014). Research data management services in academic research libraries and perceptions of librarians. Library & Information Science Research, 36(2), 84–90. https://doi.org/10.1016/j.lisr.2013.11.003
Van Tuyl, S., & Whitmire, A. L. (2016). Water, water, everywhere: Defining and assessing data sharing in academia. PLOS ONE, 11(2), Article e0147942. https://doi.org/10.1371/journal.pone.0147942
Vardigan, M., Heus, P., & Thomas, W. (2008). Data Documentation Initiative: Toward a standard for the social sciences. International Journal of Digital Curation, 3(1), 107–113. https://doi.org/10.2218/ijdc.v3i1.45
Vines, T. H., Albert, A. Y. K., Andrew, R. L., Débarre, F., Bock, D. G., Franklin, M. T., Gilbert, K. J., Moore, J.-S., Renaut, S., & Rennison, D. J. (2014). The availability of research data declines rapidly with article age. Current Biology, 24(1), 94–97. https://doi.org/10.1016/j.cub.2013.11.014
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), Article 160018. https://doi.org/10.1038/sdata.2016.18
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), Article e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Wu, F., Zhao, S., Yu, B., Chen, Y.-M., Wang, W., Hu, Y., Song, Z.-G., Tao, Z.-W., Tian, J.-H., Pei, Y.-Y., Yuan, M. L., Zhang, Y.-L., Dai, F.-H., Liu, Y., Wang, Q.-M., Zheng, J.-J., Xu, L., Holmes, E. C. & Zhang, Y.-Z. (2020). Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome (MN908947) [Dataset]. Genbank. https://www.ncbi.nlm.nih.gov/nuccore/MN908947
The included data management checklist is intended as a starting point for groups looking to realize the benefits of research data management from the laboratory side. The guide is designed to be easily customizable and extensible, so it is likely that practices specific to particular research communities or data types are not included. Because of the close relationship between the three, going through a checklist focused on data management provides an opportunity to promote activities related to reproducibility and open science
©2022 John A. Borghi and Ana Elizabeth Van Gulick. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.