The U.S. National Institutes of Health (NIH) released a broad new policy on data management and sharing on October 29, 2020, mandating that all data from NIH-funded or conducted research must be managed and most must be shared. This policy requires the submission of data management and sharing plans in all grant applications consistent with good data management practices and sets expectations for sharing of scientific data generated from NIH-funded or conducted research. The policy recognizes the benefits and opportunities in routine and effective data sharing, and therefore seeks to drive a shift in attitudes and practices across biomedicine in the handling, sharing, reuse, and valuation of biomedical data. In anticipation of the implementation of this policy on January 25, 2023, the NIH sponsored the U.S. National Academies of Sciences, Engineering, and Medicine’s Health and Medicine Division to hold a public workshop on “Changing the Culture of Data Management and Sharing.” The workshop was held virtually on April 28 and 29, 2021, and explored the issue from many angles over the 2 days, hearing from > 30 multidisciplinary speakers on what is needed to make this policy impactful and biomedicine ready for this shift. In the following, we provide a summary of the workshop, summarizing each of the sessions and the major areas that emerged over the course of it.
Keywords: Open Science, Data Sharing, NIH Data Management and Sharing Policy, Data Management Plan
In anticipation of the new Data Management and Sharing Policy of the U.S. National Institutes of Health (NIH; Office of the Director, National Institutes of Health, 2020), a virtual workshop was held by the National Academies of Sciences, Engineering, and Medicine on April 28 and 29, 2021, with the title: “Changing the Culture of Data Management and Sharing.” The workshop was sponsored by the NIH to hear from stakeholders across biomedicine on how to prepare for implementation of the policy and to explore what is needed to foster a culture shift toward broader and routine sharing of scientific data. A focus of the meeting was assessing how ready the biomedical community is for this shift in terms of attitudes, infrastructure, training, and compliance enforcement.
The workshop explored the question from many angles over the 2 days, hearing from more than 30 multidisciplinary speakers, primarily, but not exclusively, from the United States. The tone adopted was a positive one, moving away from mere enforcement of policies and toward the benefits that can accrue to the researcher and society from this culture shift. Nevertheless, participants were clear-eyed in their understanding of the hurdles that must be overcome. Biomedicine does not yet have a culture that values data; our current incentive and reward systems are still based heavily on narrative works. The basic infrastructure in the form of data repositories for housing public data is there, but has not been tested at scale. Data management in laboratories is often ad hoc, and the expertise required for effective data management and sharing is scattered across multiple stakeholders with no central coordination. To shift the culture requires not just personnel, tools, and infrastructure but metrics, credit systems, and respect for the importance of public data and the methods employed for its interpretation. Responsible and impactful data sharing was viewed as a partnership across the research ecosystem, and the community was invited to participate in this effort.
Here we provide an overview of the main discussion points and key takeaways for each workshop session, followed by an overall synthesis of the extremely rich set of presentations and discussions that took place. The summaries are not meant to be exhaustive but to highlight salient points toward the overarching goals of the workshop. All workshop slides and links to talks on YouTube are available from the conference website (National Academies of Sciences, Engineering, and Medicine, 2021).
The workshop opened with several introductory presentations to set the stage for the next 2 days. Dr. Lyric Jorgensen, Acting Associate Director of the NIH Office of Science Policy, presented the goals of the policy: to maximize data collection, foster data stewardship, and to accelerate the research enterprise (NASEM Health and Medicine Division, 2021a). These goals are rooted in the benefits of data sharing, including sparking new collaborations, enhancing rigorous study design, making high-value data sets available, enabling unique combinations of data, facilitating study validation, and stimulating new research inquiries. Dr. Jorgensen noted that the NIH is committed to data management and sharing, recognizing the importance of good data stewardship across the data lifecycle, and she emphasized that the policy will need to be flexible to adapt to the breadth of biomedicine and changing requirements. The NIH also understands that implementing the policy will cost money, and they plan on investing in both the infrastructure and staffing required.
The keynote was provided by Dr. Patricia Brennan, Director of the National Library of Medicine, entitled: “What Has the COVID-19 Pandemic Taught Us About Data Sharing and Open Science?” (NASEM Health and Medicine Division, 2021b). She started by stating what would become an overall theme of the workshop: data management and sharing is a team sport. Dr. Brennan covered NIH’s history and programs in data sharing and the impact of open science on speeding up COVID research, culminating in the success of the vaccines. She emphasized that the current policy was not in response to the pandemic but that data sharing and open science efforts have been underway for more than 25 years. Such investments, including modernizing data infrastructure by moving many platforms to the cloud, investing in data repositories, commitment to common data elements, and updating consent practices, served the community well during the pandemic because they allowed a rapid response. Existing platforms were adapted and new platforms launched for making data available, many in response to the pandemic. One of the most productive areas of medicine has been molecular biology, made possible by the near-universal sharing of genomic data.
Three questions regarding the policy and its larger impacts were considered in a panel, “Perspectives on Data Management and Sharing Across Different Types of Data,” (NASEM Health and Medicine Division, 2021c) featuring Dr. Atul Butte, University of California, San Francisco, Lara Mangravite, Sage Bionetworks, Dr. Alexander Ropelewski, University of Pittsburgh, and Dr. Joshua Wallach, Yale School of Public Health: 1) If data sharing is the means, what is the end goal? 2) What do you see as the biggest challenges and how can they be mitigated? 3) What will successful data sharing look like? How will it be measured?
The panelists agreed that there have already been significant benefits demonstrated for data sharing, particularly large prospectively gathered data sets, which have driven a significant impact on discovery science and other downstream uses. The impact and mechanisms for sharing so-called long-tailed data are not quite as clear, although there are clear benefits in aggregating across data sets, particularly when no one group can generate enough data for statistical power, for example, in the COVID pandemic and rare diseases. The benefits of routine sharing in support of transparency and reproducibility are also clear. It can be even more powerful if FAIR (findable, accessible, interpretable, reusable; Wilkinson et al., 2016) principles take hold.
The impact of the size and variety of biomedical data in implementing the new policy was considered by Dr. Ropelweski from the Brain Image Library and Dr. Wallach (NASEM Health and Medicine Division, 2021c). Dr. Ropelewski discussed the current and likely future challenges of massive microscopy imaging data sets, like those generated by the U.S. BRAIN Initiative, each currently comprising tens of terabytes or larger. Microscopy facilities are currently not configured to move this amount of data, both due to infrastructure limitations and local policies. Storage costs and accessibility are likely also to present a challenge. Dr. Wallach provided perspectives on the sharing of observational studies. Flexibility in the policy is needed, as over time we will need to settle on what actually can be shared. For certain types of observational studies, for example, those involving claims data, the researcher may not own the data, so what does data sharing mean in that case?
What does success look like? On the one hand, panelists agreed that success had to be more than just how many clicks a data set receives. Digital object identifiers (DOIs) and the capacity for citation are important, but they are lagging indicators of downstream impact. The real question is whether downstream innovation is happening from these data sets, but finding metrics for measuring this impact may be difficult, particularly because impact may take a long time to manifest. Metrics and citations for data that are continuously updated and that are made available piecemeal, as is the case with certain NIH projects, pose a challenge. The panelist expressed a desire that success should not be measured by just clicking a box or fulfilling a mandate but by a true culture change where sharing of data is viewed as important.
Moderator: Elaine Mardis, Nationwide Children’s Hospital
Speakers: Adam Ferguson, University of California, San Francisco; David Haussler, University of California, Santa Cruz; Rebecca Koskela, Research Data Alliance; Russell Poldrack, Stanford University; Jeremy Wolfe, Brigham and Women’s Hospital, Harvard Medical School
“The most important issue in data sharing is respect—respect for those who make the science possible by contributing data. But that respect has to be earned with an equal amount of generosity.” – David Haussler
This session covered the issues of data management and sharing largely from the perspective of those building and fielding infrastructures for data sharing, particularly the public data repositories. Repositories are seen as the critical pieces of infrastructure required for the policy to be successful. Nonetheless, as noted by Dr. Rebecca Koskela, most existing repositories have arisen organically without a common set of operational or design principles (NASEM Health and Medicine Division, 2021d). Certifications like the CoreTrustSeal (n.d.) are providing some basic frameworks for operating these repositories, but perhaps these frameworks have to be extended to include FAIR principles. Although a lot of focus is on the large government databases or generic repositories run by nonprofits or commercial entities, Dr. Russell Poldrack highlighted the importance of researcher-led infrastructures in the ecosystem (NASEM Health and Medicine Division, 2021e). Researcher-led infrastructures have the flexibility to do things that federal agencies cannot, due to the higher regulatory burden, but sustainability is a key challenge. It is hard to run a repository on 3-year grant cycles. However, implementation of processes such as data validators in the upload process helps shift the costs of curation to the data submitter, reducing costs for the data repository. Dr. Koskela presented evidence that policies allowing submitters to recover costs for preparing data for a repository helped to mitigate researcher concerns (NASEM Health and Medicine Division, 2021d). This session also highlighted that researchers benefit from submitting data to a repository: the data are more FAIR, ensuring that the submitter and the originating lab can find and access them; many have curation services that improve the data submitted.
The participants highlighted the scope of data with which these repositories must contend. Biomedicine has long supported large government-run databases for genomics data and a few well-recognized repositories for other data types, where data sharing has been well established. But as pointed out by Dr. Adam Ferguson, of the ‘V’s’ of big data, volume, velocity, and variety, variety may present the greatest challenge in trying to field infrastructures for the totality of biomedicine (NASEM Health and Medicine Division, 2021f). Specialized repositories can enforce standards, provide curation services, and provide tools to work with particular types of data. Dr. Jeremy Wolfe presented perspectives on behalf of the basic experimental studies with humans (NASEM Health and Medicine Division, 2021g), now referred to as BESH studies. The structure of BESH studies does not necessarily fit with current generic reporting structures like Clinical Trials.gov, highlighting the problem that variety will impose for the new policies.
Repositories for certain types of data will also likely need to adopt new architectures as the size of data and security requirements will pose problems for the current model, whereby data is downloaded from the repository to local infrastructure. Dr. David Haussler used the term “data visiting” for bringing compute to the data rather than the data to compute (NASEM Health and Medicine Division, 2021h). Such models have been successfully implemented in the Global Alliance for Genomics and Health (n.d.) and can help to minimize privacy and security concerns with sharing patient-level data. The proliferation of repositories for similar data types can also create difficulties for effective use of data. Dr. Haussler noted that a lot of genomic data is still held in silos that make it difficult to compare genomes.
Data repositories as key infrastructures for making the NIH Data Management and Sharing Policy work and provide significant benefits to researchers.
Repositories will need to align to a set of standard capabilities while at the same time exploring new architectures to maximize reuse and security.
Specialized repositories provide many benefits, moving beyond static archives to essential research tools. However, the BESH example shows that a combination of specialist repositories will be necessary, requiring standards for coordination across the ecosystem.
Long-term sustainability of repositories remains an issue and is tied to meaningful metrics that can determine their use and impact.
Submitting data to a repository brings benefits to the submitter provided that they can trust that the data is persistent and protected.
Funders’ policies can help mitigate researchers' concerns about sharing.
Objectives: To examine possible approaches for monitoring and measuring the success of data sharing and management across different types of scientific data.
Moderator: Wouter Haak, Elsevier
“Data citation—let’s choose adoption over perfection.” – Daniela Lowenberg
Session II provided perspectives on data metrics, data management plans, and training. Despite the significant work that has gone in to establish a framework for data citation and other metrics, Daniella Lowenberg noted that we do not yet have a stable system (NASEM Health and Medicine Division, 2021i). The practice of data citation hinges around a persistent identifier, preferably the DOI, and a set of practices for the format and location of these citations. Over 5 million DOIs have been registered for data sets in DataCite, yet citations to only 7,000 of these have been detected by Crossref, according to an analysis performed just preceding the conference. The vast majority of data citations are not accessible, either because researchers do not include them or journals format them in a way that they cannot be detected. Although DOIs have been pushed for citation of data sets because the infrastructure was already developed, the major stakeholders—publishers, editors, typesetters, and authors—have not gotten behind data citation implementations.
The use of persistent identifiers beyond DOIs to gain an understanding of the current state of data sharing and for future tracking was also discussed. The consistent use of persistent identifiers is one of the key pillars of FAIR principles and makes tracking data and metadata for analysis in multiple contexts much easier. Currently institutions, for example, will encounter considerable difficulty if they want to understand how many data sets are being shared by their faculty. According to Dr. Albert Zigoni, more than 90% of the data are published outside of an institution in public repositories, but less than 10% of data sets in public repositories have even a single affiliation listed and less than 1% have a persistent identifier for institutions (NASEM Health and Medicine Division, 2021j). The Dryad repository has recently implemented the use of RORs (research organization registries), persistent identifiers for research institutions that make keeping track of data sharing per institution much easier.
Tracking systems are key to the development of any incentive or compliance systems. Without effective tracking mechanisms, no incentive system can be built. However, speakers emphasized in the Q & A session (NASEM Health and Medicine Division, 2021k) that the job of infrastructure providers is to provide data on data use; they are not the ones that will determine how these data are used for any incentive system that might develop.
This session also focused on data management plans (DMPs). DMPs are already required by some funders, but their evaluation and bearing on subsequent success of the proposal are not uniform. Researchers are often confused about what should be in a plan. Some tools are available to help in constructing a DMP, but they are not well adapted to the diversity of research methods involved in producing data. Dr. Robert Hanisch acknowledged that creation of a DMP is “something that researchers don’t like to do” and emphasized that more guidance and training will be needed to make them effective (NASEM Health and Medicine Division, 2021l).
The issue of who will be the arbiters of the required data management and sharing plans (DMSPs) required by the new policy was considered in the discussion session (NASEM Health and Medicine Division, 2021m). The current policy states that NIH program staff rather than reviewers will determine the adequacy of the DMSPs. Concerns were expressed about whether the NIH would have the necessary capacity, domain knowledge, and technical expertise. Dr. Harnish also emphasized that having peers rather than program staff review DMSPs would provide community feedback on what was considered acceptable and promote that DMSPs are normative.
The critical issue of training and the role of libraries was addressed in a presentation by Dr. Elaine Martin (NASEM Health and Medicine Division, 2021n). Data librarians should be viewed as partners in the research data ecosystem and can help researchers across the data lifecycle, from planning to analysis to publishing. Many libraries have research data infrastructure and services for their constituents, but also partner with IT and other campus services to help provide third-party products that can help researchers, for example, protocols.io and electronic lab notebooks. Dr. Martin highlighted the Research Data Management Librarian Academy (RDMLA; n.d.), a free online training academy for librarians and information professionals. RDMLA launched a module specific for the NIH data policy in summer of 2021.
The ability to track data products and their use will be critical for developing any incentive systems around data sharing and reuse and to measure the impact of these policies.
Although the foundations for metrics have been laid, the systems are not routinely used. A concerted effort is required to make data citation functional and normative across the necessary stakeholder groups.
Current materials and training around DMSPs need to provide more support for researchers in making decisions about effective practices.
Researchers should make use of data librarians and institutional infrastructure throughout the data lifecycle.
Moderators: Mark Hahnel, Figshare, and Daniela Witten, University of Washington
Speakers: Timothy Coetzee, National Multiple Sclerosis Society; Scott Fraser, University of Southern California; Rick Gilmore, Penn State University; Carole Goble, University of Manchester/ELIXIR-UK; Sarah Nusser, Iowa State University; Letisha Wyatt, Oregon Health Science University
“This is not going to be driven forward by a mandate, it’s not going to be something like your mother saying ‘Eat your Brussels sprouts because they’re good for you.’ It’s going to have to be a situation like we have where restaurants make amazing Brussels sprouts that people want to eat.” – Scott Frazier
This session featured a panel discussion on what was needed to ensure that there will be uptake of the new policy (NASEM Health and Medicine Division, 2021o). Perspectives in this session focused on the laboratory, considering the tools, infrastructure, training, and education needed by the research community to implement effective data management and sharing. Several of the panelists emphasized that the new policy requires us to think through the entire data lifecycle and that doing so can realize significant benefits for researchers in the laboratory. Dr. Scott Fraser emphasized that we have to move our infrastructures from the current emphasis on ‘data banking,’ that is, only paying attention to what happens to data at the end of the study, to supporting the entire process from start to finish. The required infrastructures will not just support the data, but the workflows and ‘bundle of research objects’—protocols, code, and so on, that need to be managed with them.
A key point was the unit of sharing needed to be moved from the individual to the laboratory for these benefits to be manifest. If we can share data in the lab, then there is something worth sharing outside of the lab. Professor Carole Goble from ELIXIR, the European Union’s Research Infrastructure for Life Science Data, echoed this sentiment, and stated that data management and sharing infrastructures need to support incremental data sharing, from sharing within the lab, to trusted groups of colleagues, and finally to the public at large. Tools such as the ELIXIR Research Data Management (RDM) Toolkit written by researchers and data stewards in the language of the researchers, and with actual examples, can help to build a blueprint for different types of data and products and share digital research product know-how.
We still have a way to go in establishing rigorous and effective data sharing practices in the laboratory. Dr. Sarah Nusser noted that right now, researchers take an ad hoc approach to learning the necessary skills rather than a purposeful approach. She encouraged a phased approach to skills development that prioritizes benefits to the data producer while reducing the burden, for example, by helping researchers focus more strongly on proactively planning on how their data will be captured and processed using workflows. Dr. Rick Gilmore considered the question of what the modern laboratory needs to make it easier to implement this policy, reducing the gap between the protocols that we execute in the laboratory and the necessary data management and sharing components. Online web services such as protocols.io are available, as are computational workbooks like Jupyter Notebooks. In the “PLAY: Play & Learning Across a Year” project, the PLAY team uses video to capture the research processes. Video is readily accessible and may be applicable across many areas of biomedicine. Databrary is a data management platform that is specialized for storing video, but there are many mechanisms on the World Wide Web that can serve to make videos readily available.
Dr. Letisha Wyatt commented on the necessity of training graduate students and junior faculty, particularly with respect to lab culture. The tone of the laboratory with regards to data management and sharing is set by the principal investigator (PI) who communicates expectations and procedures. The question is whether PIs fully understand the scope of all that is required for effective data management. Few formal training courses for credit exist on research data methods, including data management. Rather, trainees are expected to absorb good practices passively or else be responsible for learning the skills on their own. Effective data management and sharing would be enhanced by a series of courses that span the entirety of graduate education, so students can both learn effective practices and receive support for applying them as they work on their dissertations. Formal training early on will ensure that these practices are embedded in research practice throughout their careers. This also may reduce the possibility of bias in the transfer of information within labs and departments.
Training is important, but we also have to start viewing data management as a discipline unto itself and start building a professional workforce that can support researchers, much like we have data scientists and statisticians available. All researchers cannot be data experts. Professor Goble noted that all institutions in the United Kingdom and much of Europe now have pools of research software engineers that are available to researchers. This pool is possible because research software engineer (RSE) has become recognized job title. We can do something similar with data stewards so that researchers can buy into a pool rather than foot the full cost. The United States has established the US Research Software Engineer Association (n.d.).
One of the concerns of funders was that researchers were going to do the minimum to comply with policies and would not think about downstream impact without incentives. The panel considered ideas such as cash prizes for data sharing, development of promotion and tenure criteria around data, and seeding committees with champions for open science. However, we have to remind ourselves that benefits cannot be thought of strictly in terms of researchers; we must also consider benefits to those that research is meant to serve.
While the issue of incentives came up repeatedly across all workshop sessions, Dr. Tim Coetzee noted that to develop and enforce the norms around data management and sharing, focusing on incentives may not be enough. We may also need punitive tools for those who do not share. Funders have been reluctant to wield these tools, but without that enforcement, Dr. Coetzee does not think we will be successful. Nevertheless, the panelists emphasized that implementation of this policy can not be an unfunded mandate. Sufficient resources must be allocated in order for these activities to be seen as important.
The panel also considered the issue of costs and resources. Data management and sharing requires a stream of money. Where will it come from? Libraries can help provide a source of stable support for helping with data sharing. Dr. Gilmore also suggested having a fund set aside, for example, 10% for data management and sharing that is given as a reward for demonstrable effective sharing.
For the policy to be successful, data management and sharing need to infuse the entire study lifecycle and not be seen as something that only happens at the end of a study.
The way to get data worth sharing is to make the laboratory the unit of data sharing on which we should focus.
Norms for data sharing need to be supported by both significant monetary and human resources.
Training for data management and sharing is not a one-off endeavor directed at PIs, but should encompass formal, relevant, structured, and mentor-trained opportunities for both PIs and graduate students.
Tools and training will need to be accompanied by a professionalized workforce of research software engineers and data stewards to support researchers as part of collaborative teams.
Emphasizing benefits to both researchers and society at large requires that examples of impact be well known.
Carrots are not enough. Oversight and compliance checking for this to be effective will be needed, and therefore successful implementation will require the willingness to wield punitive tools as well as rewards.
Moderator: Christine Borgman, University of California Los Angeles
Speakers: John Borghi, Stanford University; Ana Van Gulick, Figshare; Rafael Irizzary, Harvard University; Daniel Goroff, Alfred P. Sloan Foundation; Irene Pasquetto, University of Michigan; Jan Bjaalie, University of Oslo
“Many scientists still think of data sharing as a form of tax on their system in terms of extra effort and costs” – Richard Nakamura
This session focused on the value of data management and data sharing along with issues involved in reuse. There was general agreement that although the value of data sharing to science in general is acknowledged, the benefits to the individual researcher are not yet entirely clear. It was also clear that although many researchers feel that they are managing data within the laboratory, researchers generally do not make a distinction between ‘here’s how we do it in our lab’ as opposed to research data management that is effective for long-term reuse of data (Drs. John Borghi and Ana Van Gulick; NASEM Health and Medicine Division, 2021p). Dr. Jan Bjaalie observed that there are two types of researchers: 1) Those who do not invest in data management in the laboratory who tend to publish a lot of papers, and 2) Those who do invest in data management who build a foundation of shared data in the lab that can be built upon (NASEM Health and Medicine Division, 2021q). He noted that the latter may slow down production of papers in the short term, but ultimately allows a researcher to leverage old data in new studies, thereby increasing productivity. Sharing data through open repositories also provides a similar benefit. Students in his lab who share their data in open repositories like EBRAINS, developed by the Human Brain Project, not only get a data citation, but they are able to retrieve their data from the open repository when they wish. Data in open repositories may have also undergone additional curation, making the metadata and associated documents more organized and therefore useful.
The issue of what to share, however, still causes some confusion among researchers. While researchers understand that while they need to manage different research products arising from a study—for example, data, software, documentation, and consent forms—they are not always clear on what needs to be shared. What should be shared likely will differ between fields, but for applying analytic workflows, Dr. Rafael Irizarry noted that raw data is usually necessary (NASEM Health and Medicine Division, 2021r). Consultation with data analysts to determine what should be shared is one way to provide clarity.
Dr. Daniel Goroff addressed the issue of data reuse focusing on privacy vs. data validity concerns for human subject data, contending that it is impossible to ensure that data will not be reidentified (NASEM Health and Medicine Division, 2021s). He introduced an economic framework for understanding the trade-off between privacy and accuracy around the question of whether human data is the ‘new oil’ or a public good. Economists view oil as a rivalrous good, in that once you use it up, then no one else can use it. A public good is nonrivalrous and is not used up when someone else uses it. While we typically think of data as a nonrivalrous resource in that it can be used over and over again, in the area of privacy and data validity, the situation is more complex. Dr. Goroff argues that while we do not use up data, we do use up privacy and validity every time we query the data.
If data is reused, then what is it reused for? Dr. Irene Pasquetto presented her research on patterns of data reuse, making a distinction between ‘comparative data reuse’ versus ‘integrative data reuse’ (NASEM Health and Medicine Division, 2021t). The former involves use of data for calibrating, comparing, or confirming one’s own data, while the latter involves secondary analysis for identifying new patterns, correlations, and causal relationships. In earlier sessions, we heard examples of the latter to drive new discoveries in genomics (Dr. Wallach; NASEM Health and Medicine Division, 2021c) and spinal cord injury (Dr. Ferguson; NASEM Health and Medicine Division, 2021f). While a majority of researchers conjure up the integrative reuse when asked about the benefits of data sharing, in a study Dr. Pasquetto performed on an NIH repository for long-tail cranial facial data, the majority fell into the category of ‘comparative data reuse.’ For long-tail data, specialized expertise is often needed to reuse other’s data along with significant resources leading to a ‘data creator’s advantage.’ This advantage actually leads to a situation where it is easier to reuse data in collaboration with the creator than without them, putting the creator in a position to bargain with potential users for authorship or other forms of collaboration.
The discussion delved into the nature of data reuse and how it is defined (NASEM Health and Medicine Division, 2021u). Certainly, the holy grail of data sharing is by a third party for a use for which it was not originally intended. Dr. Van Gulick wondered whether it was fair to think of it so narrowly and noted that there are many types of reuse, including by the lab that originally acquired the data. We think of benefits to the researcher in terms of data citations and credit, and the value of data management to the researchers was noted yesterday. But as discussed, benefits to researchers also include increased collaborations leading to new publications. Dr. Irizzary also noted that public data in genomics are used to create new tools and new algorithms that can benefit researchers in their work.
The discussion also included perspectives on how data sharing can and should be paid for. The issue of using indirect costs to help offset some of the costs associated with making data public was raised.
There is little consistency in data management across lab groups and little formal training related to data management and sharing.
Benefits of data sharing should be thought of in broader terms than just third-party reuse
Ease of access to the submitter’s own data and increased collaborations.
The community as a whole benefits from new tools and algorithms that are developed from public data.
Looking at the issues in data sharing, such as privacy and cost, through the lens of different disciplines may also provide useful models for managing costs and risks.
Moderator: Kristen Rosati
Speakers: Kristen Rosati, Coppersmith Brockelman PLC; Mark Rothstein, University of Louisville School of Medicine; Cora Han, University of California; Anita Allen, University of Pennsylvania; Neil Thakur, ALS Association; Ashley Farley, Bill & Melinda Gates Foundation; Maryrose Franko, Health Research Alliance.
“Let’s focus on who has the rights to do what with the data rather than data ownership.” – Kristen Rosati
“What is required is neither a carrot nor stick but a set of guard rails and some signage.” – Cora Han
This session broadened the discussion on barriers to data sharing by framing it from privacy, ethical, and legal standpoints followed by presentations on how to encourage data sharing outside of mandates.
The issue of data privacy and security was covered from a regulatory perspective by Kristen Rosati (NASEM Health and Medicine Division, 2021v). There are many federal, state, and international laws that impact whether data can be shared, many of which are moving targets. For example, agencies that enforce the Federal Policy for the Protection of Human Subjects (“Common Rule”; Office for Human Research Protections, 2009) are required to consider whether changes in technology drive the need to change what is considered ‘identifiable.’ There are other legal issues that will impact data sharing, such as intellectual property, antitrust compliance, anti-kickback compliance, and regulations affecting downstream use (such as Food and Drug Administration regulation of artificial intelligence). There are also third-party contractual rights, and disputes about data ownership; Rosati contends it will be more useful to focus on who has rights to do what with data rather than strict definitions of ownership.
Dr. Anita Allen presented on ethical aspects of data sharing, considering the question of ‘fairness’ from several angles (NASEM Health and Medicine Division, 2021w). In particular, she noted that there is more than one set of FAIR principles. The Fair Information Practice Principles date back to the 1970s and provide a framework for expectations around protection of privacy (Dixon, 2008; EPIC, 1973; Organisation for Economic Co-operation and Development, 1980). The Fair Information Principles are used as a privacy framework in legal circles but not in bioethics circles or in current discussions around data sharing policies. Dr. Allen recommended that they could be incorporated, but cautioned that genuine ‘notice and informed consent’ called for by fair information practices is elusive and that how these principles mesh with open science and open sharing mandates will have to be determined.
Fairness was also considered in the context of treatment of marginalized peoples and their representation in the FAIR data space. We are much more sensitive to the issue of bias in available public data, at the dawn of artificial intelligence and machine learning. At the same time, previously marginalized groups may be more reluctant to make their data available due to past abuses, leaving gaps in readily accessible data. FAIR and fair data management and sharing will require a combination of both traditional bioethical principles and digital-era specific information management principles that allow researchers to meet responsibilities owed to the public regarding publicly funded research, while at the same time protecting the rights of individuals to manage self-care and their duties to others.
Given the flexibility of the policy, data governance cannot be driven entirely by carrots or sticks but must come with a set of guard rails that define the bounds of acceptable behavior. An example of such guardrails was provided by Cora Han, who presented the University of California’s principles of data governance, used in guiding their decisions around using and sharing health data (NASEM Health and Medicine Division, 2021x).
The critical issue of consent was discussed by Dr. Mark Rothstein, who flipped the usual perspective of the effect of consent on data sharing by focusing on “the effect of data on informed consent” (NASEM Health and Medicine Division, 2021y). The essence of big data research is that we do not know what will be valuable so we have to aggregate everything. Individual participants may be alarmed to know that their data may be combined with additional publicly available ‘big data’ data sources such educational or military records, or their travel patterns through geolocations. Dr. Rothstein also considered how to shape consent for the age of big data. Simply increasing the amount of information in consent forms in the name of transparency may not be the best approach. Such an approach could lead to consent bias where those who are willing to sign such consents are different from and therefore unrepresentative of those who are unwilling to participate. Dr. Allen pointed out in the Q&A discussion that the relationship between participants and researchers must be built on trust, and no amount of consent verbiage could substitute for trust (NASEM Health and Medicine Division, 2021z). The panel also considered whether mechanisms beyond consent need to be considered for protecting individual and group rights, for example, a federal law preventing reidentification or perhaps a data trust.1
The second part of Session V featured perspectives on how to encourage data sharing outside of mandates. The speakers considered how to change the language around data sharing from “Theft, Sacrifice, Obligation” to “Honored Work” (Dr. Neil Thakur; NASEM Health and Medicine Division, 2021aa) and how to get researchers to start viewing data as an ‘asset’ that is worthy of sharing (Dr. Ashley Farley; NASEM Health and Medicine Division, 2021ab). Incentives also need to be directed toward the program portfolio manager at the funding agencies. Program officers need to be incentivized and honored for making data sharing impactful across portfolios. Dr. Maryrose Franko emphasized the need for strong partners and worked examples of good practices across the workflow that can serve as templates for those trying to engage various constituencies (NASEM Health and Medicine Division, 2021ac). Although we want to move beyond mandates, tools for tracking compliance with any policies were seen as essential for moving forward. Such tools can be used to acknowledge good behavior, for example, through badges, rather than just focusing on noncompliance.
Context will always inform data governance decisions.
Data security is critical, and downstream uses must be considered.
Establishing trust with all participant groups is also essential, and requires effort put forth to achieve this.
One should expect that privacy and ethical issues surrounding data sharing will constantly evolve.
Given this evolution and the flexibility of the policy, data governance will require ethical ‘guardrails and some signage.’
Focusing just on researchers for successful implementation of this policy is too limited. Issues of concern to all stakeholders, including funders and the public, will have to be considered.
Program officers and funders should use encouragement and incentives to ensure that data sharing is ‘honored’ work.
Grant reviewers and grants officers will need training and tools to recognize well thought out and effective data management plans
Moderator: Richard Nakamura, National Institutes of Health (Retired)
Panel: Christine Borgman, University of California, Los Angeles; Philip Bourne, University of Virginia; Susanna-Assunta Sansone, University of Oxford, United Kingdom; Margaret Sutherland, Chan Zuckerberg Initiative; Neil Thakur, ALS Association; Letisha Wyatt, Oregon Health Sciences University.
“The question is not whether we are ready, it’s can we be more ready?” – Philip Bourne
“Data management and sharing should be invisible to researchers, but visible to funders.” – Susanna-Assunta Sansone
“Data generation platforms scale faster than data management solutions.” – Margaret Sutherland
The final session involved a lively discussion with a multistakeholder panel on whether the biomedical community is ready for the policy to take effect in January 2023 (NASEM Health and Medicine Division, 2021ad). As noted by Dr. Thakur, the answer is clearly ‘Yes’ at a very basic level, that is, the NIH is ready to collect DMSPs and oversee the policy. But the presentations over the 2 days showed that to do it well is extremely complex and requires an enormous infrastructure. Dr. Philip Bourne recommended that the NIH ask itself “can we be more ready?” come 2023. He still feels that the reasoning around this policy and the value of data sharing are not clear for many scientists, and therefore researchers would rather spend their time doing new research. To address this requires a cultural change in the NIH itself. The policy is just the guide, what is important is what is actually happening across the institutes and not what is being enforced.
Despite the overall feeling of optimism expressed by many, Dr. Christine Borgman was struck by how little the conversation around data sharing in this workshop differed from conversations occurring over the past 2 decades. We are still discussing incentives, sticks, and economic tradeoffs. Why is that? She noted that infrastructures are everywhere but no one is really in charge. We need to ensure that knowledge infrastructures develop and evolve to meet the challenges. Challenges remain in the areas of sustainability, not only data but software, which was not mentioned during the workshop. Data is useless without software. More studies are clearly needed to assess progress, including sociotechnical, infometric, and econometric studies.
Policies are only as good as their implementation, and Prof. Susanna-Assunta Sansone pointed out that if the policy is too vague and flexible, it is very easy to fudge compliance. She noted that data management training is like a diet where one eats often but eats very little. To transform the culture, FAIR research data management must be a first-class subject, and educational material must be produced from data professionals to empower researchers. One exemplar is the FAIR Cookbook from the European Union’s Innovate Medicine Initiative project FAIRplus, a public-private partnership between pharma, ELIXIR, and other partners. Future success will also strongly depend on research data management and sharing becoming a recognized topic for research and, conversely, data management and sharing becoming both invisible to the users but highly visible to funders.
Dr. Margaret Sutherland noted that data generation platforms scale faster than data management solutions. Data management and sharing platforms are not well supported, making it hard to keep up. Cultural change begins by training graduate students, postdocs, and staff scientists, but it is also important to create opportunities to put training into practice. How can we support communities of practice around data sharing and make these practices sustainable?
Equity as a foundation for the new policy was brought up by Dr. Leticia Wyatt, who asked a set of questions to frame the issue. Data is power and represents a return on investment, so we must be aware of the stratification of power, assets, and investments. Are we being inclusive when we talk about citizen scientists, creators, and owners? What is the impact of research and knowledge management on PEER (Persons Excluded because of Ethnicity and Race) communities? Are we embedding the culture we want to perpetuate into our training? In the area of leadership, whose perspectives are we missing? Our values and beliefs drive behavior, and collective behaviors set the culture and create norms. We are in the process of reestablishing our norms so while we are pushing forward, we can put some guardrails in place.
Culture change starts at the NIH itself, with NIH finding ways to honor effective and impactful sharing for its awardees and the staff that oversee research funding
Knowledge infrastructures are a key component that must evolve to meet the challenges that will be faced.
Policies are only as good as their enforcement, but enforcement is not the end goal. The success of this policy is not to share a lot of data but to advance science and impact health.
Data management and sharing must be recognized as a research area worthy of study and academic advancement
An equity lens should be brought to bear on data management and sharing, but equity, like data management and sharing, is a skill that needs to be learned.
The workshop offered a rich set of perspectives on the challenges and opportunities of shifting the culture on DMS in biomedicine. Many speakers remarked on the palpable shift in the conversation around data sharing. If we are going to share data, then how do we do it in a way that is impactful across all stakeholder groups? Many ideas were put forward and not all speakers agreed about what was necessary, but it was clear that there is a significant body of work, tools, and experiences that can be drawn upon as we seek to change the culture. These resources come not just from biomedicine, nor even just from science, but from multiple disciplines including digital information principles, law, and economics. It is important that biomedicine keeps the channels of communication open so that we can be informed by these experiences. The basic infrastructure is there, although it has not been tested at scale given that the amount of data that is shared is still only a fraction of that produced. Nevertheless, despite knowing that there are significant challenges ahead, we will learn by doing. Doing requires training, materials, tools, and worked examples across the data lifecycle to make the necessary processes take hold. Staffing is also a concern. The need for a professionalized workforce that can support all aspects of DMS was voiced several times at the workshop.
Data management and sharing is a team sport, that is, they require multiple skills, but the expertise needed is distributed across multiple stakeholders, including libraries. Figuring out effective structures for bringing the necessary expertise together should be a priority. Stakeholders across the spectrum of activities have to have a say in how data management and sharing is done, and for it to be done well, the incentives have to match the effort required. Efforts to ensure that data submitters are credited and recognized for their efforts to make their data FAIR and open are underway, but they are nascent and uncoordinated. The policy provides an opportunity for those seeking to implement credit and reward systems, whether they be data citations (publishers), grant funding (funders), promotion and tenure (universities), or peer recognition (all of us), to accelerate and coalesce their efforts. It is important that successes across all of these indicators are documented and shared, to help spur the culture change we seek.
Data management and sharing should be viewed as an integral part of the overall scientific workflow and therefore subject to the same careful planning and foresight as other aspects of the project. Just as placement and movement of people in a new building must be part of the earliest planning by an architect, the efficient movement—whether moving data to compute or compute to data—and use of data must be part of early planning in a scientific study.
A common theme was that data management in the lab could be a significant gateway to data sharing. The focus on data sharing has to shift from the individual to the lab as the unit, and from an individual sharing with an unknown third party, lab members sharing with each other and with colleagues. Moreover, we had some examples of how sharing in public repositories helps the submitter: through citations of their work, through possible new collaborations, and through ready access to their own data, which likely underwent some standardization and curation. Access to large public data sets also leads to better tools for individual researchers, as many developers and algorithm developers rely on these large data sets to produce these new tools.
It was clear from the workshop discussions and examples provided that when a lab is organized for open science and data sharing, the following deep benefits fall out:
The plan and open science approach is created in the beginning, therefore the DMS plan is virtually prewritten for grant inclusion.
Data cleaning and analysis are preorganized and less subject to manipulation.
Statistical modules can be built into studies.
It is easier to train new lab members in lab procedures and appropriate methodology, making it easier to assure equitable internal sharing of key methods and approaches, and simplifying FAIR data sharing with outside investigators.
Data and results are preorganized so that paper generation is faster.
Data sharing can be automated and organized to slip into larger databases; it is easier to compare results with data from other labs if similarly organized.
Electronic lab notebooks can be more readily shared and reviewed.
Video records can be incorporated.
The path and control of data that cannot be fully shared can be made explicit.
Developing the plan, tools, and training for implementation represent a significant initial investment by researchers that could be aided by grant support from government, philanthropy, and the private sector. A key advantage of the above is that subsequent investments for smoothing and improving data flow require much less effort.
Despite emphasis on the benefits of data management and sharing to the lab, the researcher, and science, many of the participants acknowledged that the research community views these activities and the NIH Data Management and Sharing Policy more as a tax and an obligation rather than valued work. This view will have to shift if the desired impact is to be realized because effective data sharing takes time, effort, and money. But although it would be great just to focus on shoring up incentives, several panelists expressed the view that incentives were not going to be enough and that without the ability to enforce the mandate, the NIH Data Management and Sharing Policy likely will not succeed. Researchers will have to see data management and sharing as a fundamental part of good science and their reputations as scientists.
Alongside incentives and enforcement, the terms ‘trust’ and ‘respect’ came up in multiple contexts. Data governance is built on trust and respect, but both have to be earned. At the most basic level, stakeholders will need to have trust in the system: trust that they are sharing data legally and trust that the infrastructure for making their data accessible will be sustained. Respect has to be accorded to those who share their data and go through the effort to make it FAIR; on the other hand, the data producers should respect those who have the expertise to extract meaning from these data, perhaps in ways that the producers could not. Researchers should not be blind to the people who participate in studies and their interests, particularly those from marginalized groups. Ethical principles from cultures and communities may not align with those of researchers and policies and these communities deserve our respect. Respect helps build trust, and trust fuels use of data. Differences in trust across cultures and communities will lead to asymmetrical results in data sharing and therefore bias in data analysis. It was also noted that with the flexibility of the policy and the difficulty in anticipating all possible uses of the data, establishing trust with research participants and having clear guidelines is critical.
What was apparent from the presentations and discussions was that the shift in culture needs to be systemwide if the new policy is to have the desired effect. Although the lab is rightly in the middle as the locus of data generation, the workshop participants shared that this system includes the institution, the wider international scientific community, and the internal culture of the NIH as well. Training, tools, evaluation methods, worked examples, and an adequate workforce are required across the board.
The authors wish to thank all the workshop participants for their edits and input with special thanks to Neil Thakur and Carole Goble for their thoughtful comments and suggestions.
Richard Nakamura has no financial or non-financial disclosures to share for this article. Maryann E. Martone is a founder and has equity interest in SciCrunch Inc, a tech start up that provides tools and services in support of rigor and reproducibility.
CoreTrustSeal. (n.d.). Retrieved January 15, 2022, from https://www.coretrustseal.org
Dixon, P. (2008). A brief introduction to Fair Information Practices. World Privacy Forum.
Global Alliance for Genomics and Health. (n.d.). Retrieved June 18, 2022, from https://www.ga4gh.org/
National Academies of Sciences, Engineering, and Medicine. (2021, April 28–29). Changing the culture of data management and sharing: A workshop. https://www.nationalacademies.org/event/04-29-2021/changing-the-culture-of-data-management-and-sharing-a-workshopOffice for Human Research Protections, U.S. Department of Health and Human Services. (2009). Federal Policy for the Protection of Human Subjects (“Common Rule”). https://www.hhs.gov/ohrp/regulations-and-policy/regulations/common-rule/index.html
Office of the Director, National Institutes of Health. (2020). Final NIH policy for data management and sharing (NOT-OD-21-013:). National Institutes of Health. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html
Organisation for Economic Co-operation and Development. (1980). OECD guidelines on the protection of privacy and transborder flows of personal data. https://www.oecd.org/sti/ieconomy/oecdguidelinesontheprotectionofprivacyandtransborderflowsofpersonaldata.htm
NASEM Health and Medicine Division. (2021a, May 12). 04/28/21 - Day 1: Goals for the NIH Data Management and Sharing Policy [Video]. YouTube. https://youtu.be/ab_PQoAgiw4
NASEM Health and Medicine Division. (2021b, May 12). 04/28/21 - Day 1: What has the COVID-19 pandemic taught us about data sharing and open science? [Video]. YouTube. https://youtu.be/2HzVUUAYS_U
NASEM Health and Medicine Division. (2021c, May 12). 04/28/21 - Day 1: Panel: Perspectives on data management and sharing across different types of data [Video]. YouTube. https://youtu.be/kaEyWaRNIvQ
NASEM Health and Medicine Division. (2021d, May 12). 04/28/21 - Session I: General challenges with data sharing - Koskela [Video]. YouTube. https://youtu.be/r9AXxbR0gU8
NASEM Health and Medicine Division. (2021e, May 12). 04/28/21 - Session I: General challenges with data sharing - Poldrack [Video]. YouTube. https://youtu.be/3QOr15XrNTQ
NASEM Health and Medicine Division. (2021f, May 12). 04/28/21 - Session I: Data formatting: Exploring challenges and potential solutions - Ferguson [Video]. YouTube. https://youtu.be/W450ZKKwWI4
NASEM Health and Medicine Division. (2021g, May 12). 04/28/21 - Session I: General challenges with data sharing - Wolfe [Video]. YouTube. https://youtu.be/JJaGZIXeG9U
NASEM Health and Medicine Division. (2021h, May 12). 04/28/21 - Session I: Data formatting: Exploring challenges and potential solutions - Haussler [Video]. YouTube. https://youtu.be/EFch0dtGeT4
NASEM Health and Medicine Division. (2021i, May 12). 04/28/21 - Session II: Exploring the current state of data citation methods and tools - Lowenberg [Video]. YouTube. https://youtu.be/tXTppfsUaEk
NASEM Health and Medicine Division. (2021j, May 12). 04/28/21 - Session II: Exploring the current state of data citation methods and tools - Zigoni [Video]. YouTube. https://youtu.be/TZTFCFGpswA
NASEM Health and Medicine Division. (2021k, May 12). 04/28/21 - Session II: Brief discussion with speakers [Video]. YouTube. https://youtu.be/p11cbo2F1oA
NASEM Health and Medicine Division. (2021l, May 12). 04/28/21 - Session II: What are the critical elements of a successful data sharing plan? - Hanisch [Video]. YouTube. https://youtu.be/GyOQLrRzLWs
NASEM Health and Medicine Division. (2021m, May 12). 04/28/21 - Session II: Brief discussion with speakers [Video]. YouTube. https://youtu.be/ulaadbL-knQ
NASEM Health and Medicine Division. (2021n, May 12). 04/28/21 - Session II: Working to establish best practices for data sharing - Martin [Video]. YouTube. https://youtu.be/jatm7M2nGmQ
NASEM Health and Medicine Division. (2021o, May 12). 04/28/21 - Session III: Moderated panel discussion [Video]. YouTube. https://youtu.be/6aaDGosvTSc
NASEM Health and Medicine Division. (2021p, May 12). 04/29/21 - Session IV: Realizing the value of data management from the laboratory side [Video]. YouTube. https://youtu.be/_rKtcvUcgg4
NASEM Health and Medicine Division. (2021q, May 12). 04/29/21 - Session IV: Reflections from the field [Video]. YouTube. https://youtu.be/loOAZSQTZw0
NASEM Health and Medicine Division. (2021r, May 12). 04/29/21 - Session IV: Data quality and factors that make data more likely to be reused - Irizarry [Video]. YouTube. https://youtu.be/1zn4--xwO68
NASEM Health and Medicine Division. (2021s, May 12). 04/29/21 - Session IV: Data quality and factors that make data more likely to be reused - Goroff [Video]. YouTube. https://youtu.be/M_9ft6VITA0
NASEM Health and Medicine Division. (2021t, May 12). 04/29/21 - Session IV: Data quality and factors that make data more likely to be reused - Pasquetto [Video]. YouTube. https://youtu.be/bCgPNQuGf_w
NASEM Health and Medicine Division. (2021u, May 12). 04/29/21 - Session IV: Panel discussion [Video]. YouTube. https://youtu.be/PG7GHvbhCoE
NASEM Health and Medicine Division. (2021v, May 12). 04/29/21 - Session V: Overview of legal issues around data sharing - Rosati [Video]. YouTube. https://youtu.be/KufyDoykuQg
NASEM Health and Medicine Division. (2021w, May 12). 04/29/21 - Session V: Implementing the NIH Data Management and Sharing Policy - Allen [Video]. YouTube. https://youtu.be/LnQN5OieIBM
NASEM Health and Medicine Division. (2021x, May 12). 04/29/21 - Session V: Creating good data governance - Han [Video]. YouTube. https://youtu.be/LCHoQwNyX-w
NASEM Health and Medicine Division. (2021y, May 12). 04/29/21 - Session V: Seeking “informed” consent for large-scale data sharing - Rothstein [Video]. YouTube. https://youtu.be/5yfmJwbElK8
NASEM Health and Medicine Division. (2021y, May 12). 04/29/21 - Session V: Discussion/Q&A with the speakers and participants [Video]. YouTube. https://youtu.be/PfzvkaKODWo
NASEM Health and Medicine Division. (2021aa, May 12). 04/29/21 - Session V: Encouraging data sharing outside of mandates - Thakur [Video]. YouTube. https://youtu.be/DPLKtH5Pw2I
NASEM Health and Medicine Division. (2021ab, May 12). 04/29/21 - Session V: Encouraging data sharing outside of mandates - Farley [Video]. YouTube. https://youtu.be/D6BYWtLuXCo
NASEM Health and Medicine Division. (2021ac, May 12). 04/29/21 - Session V: Encouraging data sharing outside of mandates - Franko [Video]. YouTube. https://youtu.be/SbmY9Jlfb8A
NASEM Health and Medicine Division. (2021ad, May 12). 04/29/21 - Session VI: Multi-stakeholder panel with select speakers [Video]. YouTube. https://youtu.be/T1AUNoe3BQA
Research Data Management Librarian Academy (n.d.) Retrieved June 19, 2022, from https://rdmla.github.io/
Twig Interactive. (2021). What is a data trust? The role of data trust in data sharing initiatives.
US Research Software Engineer Association. (n.d.) Retrieved June 19, 2022, from https://us-rse.org
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3, Article 160018. https://doi.org/10.1038/sdata.2016.18
Zarkadakis, G. (2020, November 10). “Data trusts” could be the key to better AI. Harvard Business Review. https://hbr.org/2020/11/data-trusts-could-be-the-key-to-better-ai
©2022 Maryann E. Martone and Richard Nakamura. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.