The widespread adoption of open data practices has resulted in a wealth of open data sets and increased attention on data as a public asset. In order to fulfill responsible data stewardship, we must move beyond enabling data access and take steps to understand how administrative and research data are found, accessed, and reused. We need standardized evaluations of data usage to assess whether and how open data advances policy, science, and society, and this can only happen through evidence-based data metrics. To catalyze adoption of data metrics, the Make Data Count Summit convened representatives from government agencies and research-supporting organizations to tackle current challenges and explore actionable next steps. The multi-stakeholder discussions highlighted that the building blocks to reliably developing data metrics are now in place. Through community-built tools and infrastructure, it is now possible to make connections between data and other objects and to make those available to the community at a scale not possible before. Building on these connections, administration and research communities are exploring how to incorporate data evaluation into their processes. Much remains to be done, but the current challenges and gaps around data discoverability and usability can serve as prompts to focus the next steps in community efforts and the questions to address through bibliometric studies. We outline immediate priorities for government agencies, funders, institutions, publishers, and researchers to drive adoption of data metrics. The recommended next steps are possible through support for the technical infrastructure in place and compliance policies across institutions, and are underscored by a collective commitment to building trust. For data metrics to be trusted, they must be in alignment to the principles that underlie open data. The development and adoption of open data metrics must be prioritized now for the responsible stewardship of data as a public asset.
Keywords: data metrics, open data, data usage, evaluation, research policy, infrastructure
Policy and infrastructure developments over the last decade have enabled widespread sharing of research and administrative data. However, data stewardship has so far focused on data access, and our understanding of what data is used and how open data translates into societal benefit remains limited. Making such evaluations is only possible with evidence-based data metrics. While government administrators and research organizations recognize the need for evaluating data usage, progress in the adoption of data metrics has stalled. The Make Data Count Summit convened representatives from government agencies and research-supporting organizations to tackle current challenges and explore actionable next steps. The diverse stakeholders agree that the necessary technical and cultural infrastructure are now in place to support the development of data metrics. Ongoing projects such as Democratizing Data and the Data Citation Corpus (led by DataCite in collaboration with the Chan Zuckerberg Initiative; CZI) are making data usage information available to the community at an unprecedented scale. Assessing data usage will require an inclusive approach sensitive to the variety of data types and the needs of different communities. All stakeholders can take steps to support adoption of data metrics in the immediate future: Researchers and infrastructure providers should explore improvements to the accessibility and discoverability of open data, academic journals must improve their workflows for data citations, and research and administrative institutions should incorporate evaluations of data reuse in their assessment frameworks. Approaches to developing data usage metrics will need to address complex questions and topics, including those around diverse data types and use cases, and insights into data sets currently underutilized, and data users not currently accounted for. For data metrics to be trusted, they must align to the principles of open data: data metrics and all the inputs that underlie them must be complete, transparent, auditable, and contextualized. The time is now to prioritize data metrics and advance toward a state where open research and administrative data are routinely evaluated and reused to drive policy, discovery and societal benefit. This must happen for the responsible stewardship of data as a public asset.
In many respects, today’s open data practices closely resemble the ideals that the pioneers of the open data movement envisioned for research and administrative data communities. Policies now exist across government and research—for example, in the United States, the Foundations for Evidence-Based Policymaking Act (2018), the updated National Institutes of Health Data Management and Sharing Policy (National Institutes of Health, 2020), the 2022 White House Office of Science and Technology Policy (OSTP) memo (OSTP, 2022), and the policies by Plan S and the European Commission (Burgelman et al., 2019; Plan S, n.d.) that have transformed government and research perspectives around opening up data for reuse. Trusted infrastructure exists to support preservation, publication, curation, and secure access to diverse types of data, and to meet the needs of wide-ranging use cases (Pampel et al., 2023). More broadly, the promise of open data is embraced across public, private, and research sectors (UNESCO, 2021).
While policy and infrastructure for research and administrative data publishing have flourished over the last decade, data stewardship practices have so far primarily focused on data accessibility, but left out the evaluation of data usage, that is, the standardized assessment of how data are found, accessed, analyzed, and utilized as part of policy development and research activities. As a result, there are more data sets available than ever before—either openly accessible or via platforms that enable controlled access for data sets that cannot be made publicly available—but we still lack evidence on whether and how they are being reused as part of research activities or policy development. This limits our ability to reward those who produced the data that is reused, and deprives institutions, funders, and government agencies of key insights to make informed decisions about best practices, funding, and policy.
Infrastructure is available for researchers to share their data openly, and there are examples of communities where the availability of high-quality open data has enabled broad reuse, for example, the Hubble Space Telescope Data has been used in many publications often not involving scientists who participated in data acquisition. However, such broad discoverability is not yet the norm across fields and technical systems, which limits opportunity for reuse and in turn acknowledgment for the data producer. Gaps in discoverability are highlighted by a recent analysis showing that among researcher ORCID records reporting research outputs, only 0.3% reported producing data sets (Sixto-Costoya et al., 2021). Importantly, research assessment frameworks remain heavily dependent on publications and few include data in the evaluation of research contributions. Sharing data may be a policy requirement, but there is little incentive for demonstrating that data sets have been reused or for developing new lines of research that leverage reuse of existing data sets. As a result, researchers still perceive data sharing as an additional burden that will bring little professional benefit (Perrier et al., 2020).
For decades there has been discussion around whether ‘carrots' (incentives) or ‘sticks’ (policies and mandates) are most effective at driving open data. Some funders’ data policies stipulate that funding may be withheld or future applications refused if requirements for data sharing are not complied with (Bill & Melinda Gates Foundation, 2011; National Institutes of Health, 2020), but whether regular monitoring of compliance with such policies takes place remains unclear. On the side of incentives, ‘impact’ and ‘reproducibility’ claims are often based on assumptions and anecdotes rather than quantitative and reliable evidence. Administrative and research communities lack an understanding of which data should be made open and for what uses, what data is highly used or underutilized, and whether and how data is translating into societal benefit. This begs the question: If we do not evaluate data usage, how can we build trust in our programs or goals of advancing science and society?
Data metrics are intrinsically linked to data sharing and are a key piece to inform the further development of open data practices. For open data to drive research and policy development activities, data sharing needs to be rewarding for data producers. And getting to this point requires evidence-based open data metrics (i.e., ‘data metrics’) that support the evaluation of data usage. Data metrics are defined here as meaningful and contextualized quantitative or qualitative measures of how open data sets are accessed or utilized (Lowenberg et al., 2019). Data metrics may include measures of data usage (e.g., in the form of normalized counts of views and downloads), data citations (i.e., structured references to data as part of a scholarly work), or other measures developed to account for new uses of data as they develop and mature (Lowenberg et al., 2019).
For individual researchers, data metrics can support building a narrative around the reach of their research contributions. As research assessment frameworks evolve to incorporate research objects beyond journal articles, data must be a key addition among the research contributions considered. Contextualized data metrics will enable researchers to provide evidence of the reach of their data, and evaluation committees to complete a meaningful evaluation of those contributions.
As we consider the collections of data sets produced by research and government activities, data metrics will also provide critical insights to institutions, funders, and government agencies into the behaviors and practices of different communities. Measures of data usage and data citation may highlight what data sets are consistently being used over time, or are bridging multiple disciplines to enable new lines of research. They may also signal whether data sets are underutilized, or accessed by specific groups such as patients or the public. As knowledge grows around what data sets are being used, by whom, and for what purposes, there may be increased capabilities to understand the value and impact (financial, academic, and societal) of our collections of open data sets.
There has been wide debate around the use, and misuse, of metrics as part of research assessment. The use of citation counts for journal publications has led to attempts to improperly inflate the citation counts for some papers (Mehregan, 2022); these behaviors should not be repeated in the context of data citations. To minimize the risk of misuse, communities must be advised to avoid a path toward an oversimplified single measure and instead build a suite of metrics that provide different perspectives around the use of data sets. Such metrics should account for the fact that some research disciplines rely on extensive data usage while others consume data to a lower degree. In addition, the perceived value of specific data sets may vary according to the scarcity of such data sets (e.g., data from rare specimens of populations) or their use case (e.g., to inform patient care vs. software development) and not be fully reflected in the overall counts of views and downloads. It is important to continue to develop and expand community-driven standards for filtering out incongruous patterns as part of the normalization of data usage counts. As an example, the COUNTER Code of Practice for Research Data, a standard to normalize usage metrics across data repositories, provides guidance for excluding activity generated by internet robots, crawlers, and spiders (Fenner et al., 2018).
We have a duty toward responsible data stewardship to advance knowledge equity and bridge the digital divide. To make this possible, we need evidence-based, meaningful, and trusted data metrics. This is a singular need with diverse use cases and perspectives. We recognize that various stakeholders and disciplines are at different levels of maturity around data sharing adoption, with varying levels of confidence around their readiness for data metrics. This diversity of community practices should be accounted for, and simultaneously, we can start now with the development of responsible data metrics.
Many in the community recognize the importance of data usage evaluation and have had conversations about data metrics within their domains, or in forums related to open data. However, progress in adoption of data metrics has stalled. In order to develop meaningful metrics for data, there needs to be a foundation of open data sets and infrastructure that supports capturing data usage information. Community practices for sharing have made progress over the last decade, but there has been less progress in the implementation of workflows to enable the standardized collection and reporting of data usage counts. The COUNTER Code of Practice for Research Data, while implemented at some repositories, is not commonly adopted. On the publisher side, many journals have not yet optimized their workflows to adequately capture data citations in article reference lists and to deposit those citations as part of their metadata. As a result, many data citations are not propagated through the scholarly infrastructure, denying researchers opportunities for credit from those contributions, and limiting our understanding of how data sets are used in research. This incomplete picture of data usage and the lack of a clear set of resources to obtain normalized and contextualized data metrics pose challenges for funders, institutions, and government agencies seeking to incorporate data into their evaluation processes.
The Make Data Count initiative seeks to advance the development and adoption of data metrics and has collaborated with diverse communities to develop tools and practices to tackle those hurdles. In order to highlight ongoing efforts, and provide the community with a forum where different stakeholders could discuss coordinated actionable next steps, the initiative hosted a dedicated event on data metrics in September 2023. The Make Data Count Summit provided a unique opportunity to, for the first time, convene government agencies and those involved in scholarly research to tackle nuanced issues about the importance of data metrics. The event brought together representatives across research and research-supporting organizations, government and policy institutions, and infrastructure providers to discuss how to drive broader development and adoption of data metrics.
A common theme throughout the summit was that all stakeholders see the evaluation of data usage as an important immediate need in their agendas. While government administrators and research organizations may approach the evaluation of data usage with diverse motivations and methods, there are some clear areas of alignment. Opening up data provides context into policy development and enables scrutiny and reproducibility for research; open data is critical to ensure trust in policy and research. As part of any evaluation of data usage, we should be mindful of not eroding that trust by taking shortcuts. Bibliometrics studies have shown that researcher practices for data use, reuse, and citation vary across disciplines (Gregory et al., 2023), and that data sets are used in a variety of ways (Khan et al., 2021). Oversimplification of metrics, while tempting, will perpetuate perverse incentives and create additional inequities. We should ensure that the data underlying any metrics are openly available to the community, to prevent the risk for opaque or commercially driven metrics that would undermine trust (Lowenberg, 2022). Next steps toward data metrics need to be taken responsibly, ensuring that any metrics included in research evaluation align to the values that institutions aim to reward—such as open practices and reproducibility—and the questions that agencies seek to answer around the use of government data.
If [data metrics] it’s done right, it's okay. If it's done, like, let's say the h-index then oh my god, please no. |
---|
The Make Data Count Summit highlighted that there are strong foundations upon which to build this data metrics future. Infrastructure has evolved substantially over the last few years. For example, persistent identifiers are gaining adoption, enabling workflows that make connections between data and other research objects. Bibliometric studies can identify information that is missing or incomplete and inform how to enrich the metadata we capture to help us continue to contextualize data metrics. Developments in artificial intelligence and machine learning are enabling the interrogation of articles and reports to surface how data sets have been used in research studies and government reports, as highlighted by the work done by Democratizing Data (Lane, 2022) and the CZI (Chan Zuckerberg Initiative Science, 2021).
Building on these needs, DataCite, a nonprofit organization that provides services to create and manage persistent identifiers and has led efforts in open infrastructure to support data metrics, is building the Data Citation Corpus in partnership with CZI and with support from the Wellcome Trust (Vierkant, 2023). The corpus seeks to expose citations to data from a variety of sources, to enable community access to data usage information at an unprecedented scale in an open manner. Support for shared infrastructure like this will allow for their growth into mature tools that can be incorporated into evaluation processes.
The necessary foundations for cultural infrastructure are also in place. Communities across the research and government sectors are actively discussing ways in which data metrics can be used as part of evaluations of data usage. The U.S. Foundations for Evidence-Based Policymaking Act (2018) has provided strong guidance for opening and understanding the use of administrative data. Research funders and institutions (e.g., the signatories of the CoARA agreement to reform research assessment practices, and the participants in the U.S. National Academies Roundtable on Aligning Incentives for Open Scholarship and the Higher Education Leadership Initiative for Open Scholarship [HELIOS Open] are increasingly looking to update their evaluation frameworks to deliver a more equitable assessment of individual researchers and to gain a holistic understanding of the reach of the activities by their grantees and faculty. Such updates to research assessment must include data, with mechanisms for researchers to report on the data sets they have produced, and on their usage and reach.
While there are strong foundations to support data metrics, much remains to be done. These difficulties are frequently framed as barriers, where some communities hesitate to get involved in the conversation until all the technical and cultural infrastructures are perfect. While such difficulties should not be brushed aside, it is essential to prioritize productive discussions around the dynamics of how data are reused, and regard the challenges that arise in the course of our work as the guide for what areas we should focus on next.
The prioritization of meaningful evidence-based open data metrics requires a focus on technical infrastructure, compliance policies, and bibliometric research, along with a collective commitment to building trust among stakeholders.
Based on community discussions, including those hosted at the Make Data Count Summit, we outline the below areas to prioritize as immediate next steps:
Bibliometrics studies to understand how data sets are found and used, and by whom, across disciplines and use cases, and to measure the impact of individual data set contributions to a larger data collective.
Researchers and infrastructure providers should explore improvements to the accessibility and discoverability of open data, this includes understanding current barriers to find relevant data sets and how discoverability impacts reuse metrics for data sets.
Government agencies and funders should take steps to optimize administrative and funder metadata for discoverability (e.g., adopting persistent identifiers for person, institution, funder), and adopt existing standards and best practices for administrative data collection and analysis.
Research and administrative institutions should update their assessment frameworks to incorporate evaluations of data reuse. Such assessments should also seek to gain insights into ‘what is missing’ in current evaluations, what data sets are not being used, and who are the data users that are underserved or not currently being accounted for.
Journals must prioritize improving their workflows to report data citations.
All of this is possible short term. Standards and infrastructure are already available to support and scale the adoption of data usage metrics across platforms (Cousijn et al., 2019). Discussions around updates to assessment frameworks are also underway at many funders and institutions. Policy implementation can take an incremental approach, starting with outreach and guidance (e.g., pointing to resources and tools that researchers and administrators can use to report on data usage) then gradually moving to embed data usage evaluation into all steps of research and policy, evolving processes as data metrics develop and mature.
As progress is made into these immediate priorities, additional steps can be taken to address additional nuances in data usage, including understanding differences in reuse between raw data and aggregate data (and their relations to each other), and the reuse of dynamic and operational data as opposed to the static archived data set. As we address these different aspects, we can collectively build evidence on the role that data sets play in advancing knowledge to result in public good.
Research and convenings that address these complex questions will need to be resourced and pursued in parallel to ongoing work to support global adoption of data citation and usage practices and reporting. Addressing these priorities should happen across sectors to avoid the duplication of efforts and findings. Conversations and resourcing should not wait until data sharing is perfect, because the development of data metrics and insights into data usage will inform best practices around data sharing. We can align our resources and focus now to ensure an effective, inclusive, and steady pathway to open data metrics.
The driving force behind all efforts toward open data metrics needs to be building trust with data producers and data reusers, synergistically aligned with the goals of opening up data. The use cases for open data in academia, government, and industry are complex and diverse, but they all share common principles: open data should be complete, transparent, auditable, and contextualized. These principles are concisely wrapped up in the popularized notion of FAIR data (Wilkinson et al., 2016) to optimize data for reuse.
The need for trust transcends across data types and the communities of diverse producers and users. Metrics, diverse as they may be, need to reflect the same values and principles of open data. They should also avoid concrete boxed metrics like a data impact factor or a data h-index that could lead to unintended or ill-desired behavior, and harmful outcomes for research and policy. To achieve trust and mitigate risk, the inputs that underlie those metrics, for example, citations, usage counts, metadata, and so on, must abide by the same principles as open data. We must proactively acknowledge the need for data metrics to be complete, transparent, auditable, and contextualized (Lowenberg, 2022).
Now is the time to prioritize and build on shared interests in reaching a state where open research and administrative data are routinely evaluated and reused to enhance scientific discovery, advance policy, and better the world. This requires a shift in prioritization away from the last decade’s emphasis on data sharing writ large and toward a dedicated lens on data metrics.
The development and broad use of open data metrics can happen because people and technology are ready, it will happen when resources are dedicated toward this need, and it must happen for the responsible stewardship of data as a public asset.
We thank the members of the Make Data Count Advisory Group, Maria Gould, Kristi Holmes, Carly Strasser, and Jamie Wittenberg, for insightful discussions around data metrics and valuable suggestions for this article.
Iratxe Puebla is Director of Make Data Count, an initiative that seeks to drive development and adoption of open data metrics, and an employee of DataCite, which is developing the Data Citation Corpus. Daniella Lowenberg is an advisor to the Make Data Count initiative.
Bill & Melinda Gates Foundation. (2011). Global health data access principles. https://docs.gatesfoundation.org/Documents/data-access-principles.pdf
Burgelman, J.-C., Pascu, C., Szkuta, K., Von Schomberg, R., Karalopoulos, A., Repanas, K., & Schouppe, M. (2019). Open science, open data, and open scholarship: European policies to make science fit for the twenty-first century. Frontiers in Big Data, 2. https://www.frontiersin.org/articles/10.3389/fdata.2019.00043
Chan Zuckerberg Initiative Science. (2021, November 1). Extracting knowledge from biomedical literature. Medium. https://cziscience.medium.com/extracting-knowledge-from-biomedical-literature-402b4bed680a
Cousijn, H., Feeney, P., Lowenberg, D., Presani, E., & Simons, N. (2019). Bringing citations and usage metrics together to make data count. Data Science Journal, 18(1), Article 9. https://doi.org/10.5334/dsj-2019-009
Fenner, M., Lowenberg, D., Jones, M., Needham, P., Vieglais, D., Abrams, S., Cruse, P., Chodacki, J., & the Make Data Count Project. (2018). The COUNTER Code of Practice for Research Data. Appendix I: List of internet robots, crawlers and spiders. Project Counter. https://www.projectcounter.org/code-of-practice-5-0-1-sections/appendix-i-robots-crawlers-spiders/
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174
Gregory, K., Ninkov, A., Ripp, C., Roblin, E., Peters, I., & Haustein, S. (2023). Tracing data: A survey investigating disciplinary differences in data citation. Quantitative Science Studies, 4(3), 622–649. https://doi.org/10.1162/qss_a_00264
Khan, N., Thelwall, M., & Kousha, K. (2021). Measuring the impact of biodiversity datasets: Data reuse, citations and altmetrics. Scientometrics, 126(4), 3621–3639. https://doi.org/10.1007/s11192-021-03890-6
Lane, J. (2022). A vision for democratizing government data. Issues in Science and Technology, 34(1). https://issues.org/democratizing-government-data-lane/
Lowenberg, D. (2022). Recognizing our collective responsibility in the prioritization of open data metrics. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.c71c3479
Lowenberg, D., Chodacki, J., Fenner, M., Kemp, J., & Jones, M. B. (2019). Open data metrics: Lighting the fire (Version 1). Zenodo. https://doi.org/10.5281/zenodo.3525349
Mehregan, M. (2022). Scientific journals must be alert to potential manipulation in citations and referencing. Research Ethics, 18(2), 163–168. https://doi.org/10.1177/17470161211068745
National Institutes of Health. (2020). NOT-OD-21-013: Final NIH policy for data management and sharing. Office of the Director. https://grants.nih.gov/grants/guide/notice-files/NOT-OD-21-013.html#_ftn8
Pampel, H., Weisweiler, N. L., Strecker, D., Witt, M., Vierkant, P., Elger, K., Bertelmann, R., Buys, M., Ferguson, L. M., Kindling, M., Kotarski, R., & Petras, V. (2023). Re3data – Indexing the global research data repository landscape since 2012. Scientific Data, 10(1), Article 1. https://doi.org/10.1038/s41597-023-02462-y
Perrier, L., Blondal, E., & MacDonald, H. (2020). The views, perspectives, and experiences of academic researchers with data sharing and reuse: A meta-synthesis. PLOS ONE, 15(2), Article e0229182. https://doi.org/10.1371/journal.pone.0229182
Plan S. (n.d.). Principles and implementation. Retrieved November 29, 2023, from https://www.coalition-s.org/addendum-to-the-coalition-s-guidance-on-the-implementation-of-plan-s/principles-and-implementation/
Puebla, I., Lowenberg, D., Buys, M., Strasser, C., Lane, J., Haustein, S., Robinson-Garcia, N., Thelwall, M., & van Leeuwen, T. (2023, September 12–13). Make Data Count Summit presentations. Washington, D. C. https://doi.org/10.5281/zenodo.8370593
Sixto-Costoya, A., Robinson-Garcia, N., van Leeuwen, T., & Costas, R. (2021). Exploring the relevance of ORCID as a source of study of data sharing activities at the individual-level: A methodological discussion. Scientometrics, 126(8), 7149–7165. https://doi.org/10.1007/s11192-021-04043-5
UNESCO. (2021). UNESCO Recommendation on open science. UNESCO Digital Library. https://doi.org/10.54677/MNMH8546
Vierkant, P. (2023). Wellcome Trust and the Chan Zuckerberg Initiative partner with DataCite to build the Open Global Data Citation Corpus. DataCite. https://datacite.org/blog/data-citation-corpus-announcement-2023/
White House Office of Science and Technology Policy. (2022). Ensuring free, immediate, and equitable access to federally funded research [Memo]. Executive Office of the President of the United States. https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1), Article 1. https://doi.org/10.1038/sdata.2016.18
©2024 Iratxe Puebla and Daniella Lowenberg. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.