Not really a group - a private for-profit company now owned by Springer Nature
In this article we describe how data usage statistics could be designed to reward the contributions of both researchers and government agencies, functionally creating a carrot that rewards engagement. We provide a concrete use case—the Democratizing Data pilot project—and discuss how the results could be used to develop a community-driven incentive structure, or an ‘invisible hand,’ to generate better evidence for policymakers and lower costs for the taxpayer. We also note the potential to extend the approach and platform to other data-driven initiatives, notably investments in artificial intelligence.
Keywords: democratizing data, data value, Evidence Act, community-driven incentive structure, invisible hand
A new framework for public data—including both government fiat and appropriate incentives—is greatly needed to ensure that we can extract as much value as possible for the public good. That requires data on the use of data. The results of government fiat have been mixed over the past decade. This article proposes a practical and readily implementable incentive-based approach that integrates policies, infrastructure, and tools across federal agencies to better serve the scientific enterprise and public. It draws on the lessons learned from a successful democratizing data approach that could be calibrated to create an ‘invisible hand’ for data production and use.
The vision of the 2016 Commission for Evidence-Based Policymaking was to transform the way public policy is developed by having government agencies and researchers combine forces to produce data and evidence (The Commission on Evidence-Based Policymaking, 2017). It was also the impetus behind the Foundations for Evidence-Based Policymaking Act (2018, hereafter ‘Evidence Act’), which required agencies to maximize the use of data for the production of evidence that would serve as the backbone for new public policies. The act also requires agencies to report usage statistics, produce data inventories, and engage with the public. In this article we describe how the resulting data usage statistics could be designed to reward the contributions of both researchers and government agencies, functionally creating a carrot that rewards engagement. We provide a concrete use case—the Democratizing Data pilot project (Democratizing Data Project, 2024) and discuss how the results could be used to develop a community-driven incentive structure, or an ‘invisible hand,’ to generate better evidence for policymakers and lower costs for the taxpayer. We also note the potential to extend the approach and platform to other data-driven initiatives, notably investments in artificial intelligence (AI).
Market mechanisms work on the principle of self-interest; a benefit-cost calculation. As Adam Smith pointed out in the Wealth of Nations, when describing the role of the marketplace, “it is not from the benevolence of the butcher, the brewer, or the baker that we expect our dinner, but from their regard to their own interest” (Smith, 1776). In the private sector, market signals have traditionally led to the creation of a vast variety of new data-driven products and transformed economic and social activity (Galloway, 2018). At least part of the reason is that firms have a clear profit incentive. Sensibly managed private organizations only create, gather, or use data when there are reasonably clear perceived benefits. Such benefits may be measured as direct increases in revenue or indirect measures of revenue such as improvements to product quality or reputation. Reports show that private sector firms that employ data-driven techniques are between 3% and 6% more productive than their counterparts, but in some cases, the use of data is transformational and turn firms into superstars (Brynjolfsson & McElheran, 2016; Farboodi & Veldkamp, 2023; Tambe et al., 2020).
The incentive structure is different for both government agencies and the researchers who produce and use data. Whereas businesses in the private sector can observe the benefits of the products that are produced from the use of data, typically, government agencies operate with limited signals about how the data they produce are used, and thus limited information about what and how much to produce. In the case of data producing agencies, congressional mandates require that they produce what they have always produced, with limited resources to innovate and change (Lane, 2021; Norwood, 1995). And unlike data driven firms in the private sector, data driven government agencies—particularly the federal statistical system—have seen their budgets decline with the perverse consequence that “staffing shortages have severely limited agencies’ abilities to take advantage of advances in statistical and data science necessary to modernize their statistical programs” (American Statistical Association, n.d., p. 2).
In the case of empirical researchers, the current status quo does not reward them for making research data available. The rewards are for publications, not for data production—also known as the “Everyone wants to do the model work, not the data work” syndrome (Sambasivan et al., 2021). Furthermore, researchers may gain competitive advantage in publishing or consulting when they are the only ones with access to data (Ioannidis, 2005; Nelson, 2009). The result too often is that research data are not made available, and if the data are made available, the effort put into curation and documentation is perfunctory. Although an important and very timely assist has been provided by the mandates from the executive branch that require agencies, researchers, and publishers to provide access to government data (Executive Order No. 13642, 2013) and federally funded research data (Holdren, 2013; Nelson, 2022), a recent analysis of nearly 1,800 papers in which authors had indicated that they would share their data when publishing, only 7% actually complied with written requests (Watson, 2022).
In other words, there are still substantial barriers faced by agencies and researchers to providing access to and analysis of data to produce evidence. As the Commission for Evidence-Based Policymaking report pointed out, progress in evidence-based policy is stymied by “lack of access by researchers outside of government and by individuals within government to the data necessary for evidence building, even when those data have already been collected” (The Commission on Evidence-Based Policymaking, 2017, p. 8). The U.S. General Accountability Office released a report in July 2023 that found “mixed progress across government in collecting, analyzing, and using evidence effectively to make decisions” Fast Facts (U.S. Government Accountability Office, 2023).
Measuring the usage of data assets could break the current suboptimal situation and break down the barriers by providing new information about the benefits and costs of data access and use for both government agencies and researchers.
Government agencies, and hence the taxpayer, could directly benefit from more information about the benefits and costs of investments in data. Evidence of the benefits—for example, data that improves the human condition (Reynolds et al., 2020) or that leads to a more efficient agricultural sector (Pallotta et al., 2024)—could increase support for improving and expanding those data sources. Conversely, if data are not in high demand, data collection may no longer be necessary and costs could be reduced by reallocating resources to other priorities (Lane, 2021) .
The researcher benefit-cost calculus would also be changed to benefit society. On the benefit side, researchers whose data contributions are cited and reused could be rewarded by providing greater data access or funding. In addition, if researcher contributions are quantified and highlighted by government agencies, researchers could include their data visibility in the same way that they cite their publications and grants, providing an incentive for researchers to share code, learn about the idiosyncrasies of new types of data, and build on each other’s work (Lane, 2023). Recent work suggests that up to 85% of biomedical research efforts are somewhat wasteful—at least partly due to lack of data sharing (Freedman et al., 2015; Munafò et al., 2017). The result—possibly billions of dollars in savings to the taxpayer and certainly better science. If the costs of accessing and using data were lowered, both researchers and research support staff would benefit; the Advisory Committee on Data for Evidence Building (ACDEB) noted that at least a third of access costs could be attributed to staff time allocated to getting researchers up to speed on the intricacies of complex data.
One goal of this article is to describe the actors and incentive structures that characterize the current implicit ‘market’ for the collection, curation, and dissemination of public data. A second is to illustrate that it is possible to take concrete actions that can change the existing benefits and costs. To that end, the article devotes significant space to illustrating how the output of one such concrete action—the pilot Democratizing Data Project (2024) which was set up by four statistical agencies in response to the Evidence Act—could potentially be used to incentivize and improve researcher and agency data practices. Much has been and is still being learned from that pilot experience; this article also highlights some of the many challenges that have been identified. The article concludes by identifying possible next steps that could be taken by agencies (including their chief data officers and researchers) to fully realize the vision of the Evidence Act.
The executive and legislative branches of government have moved aggressively to require that both the value of data use and the engagement of relevant communities be integral to producing evidence to improve public policy. Title II of the Evidence Act (2018) requires that agencies provide information about how their data are used by the public, to produce evidence, and appoint officials—chief data officers, chief evaluation officers, and statistical officials—responsible for ensuring that the information is provided.
In addition, as required by the Evidence Act, the ACDEB provided a set of recommendations to the Office of Management and Budget. One of these was that “evidence on data use should be used to inform the measurement of value” as “evidence on data use can inform the measurement of value and, by extension, be used to increase value” (Advisory Committee on Data for Evidence Building, 2022). The CHIPS and Science Act (2022) creates the concomitant infrastructure for funding the establishment of a National Secure Data Service Demonstration project, in part to align with some of the recommendations of the ACDEB.
The executive branch has also provided guidance about the role of incentives and stakeholder engagement in creating value for data. The National Institute of Standards and Technology’s (NIST) Research Data Framework (Hanisch et al., 2023; Hanisch et al., 2021) repeatedly identifies the importance of identifying the data value proposition, engaging with relevant stakeholders, and allocating credit at all stages of the data lifecycle. Similarly, while the framework identifies the normal process requirements for data quality (including “accuracy, completeness, update status, relevance, consistency across data sources, reliability, appropriate presentation, accessibility”), the lead requirement for quality data is the purpose and value of the data (Hanisch et al., 2023).
The importance of incentives and community engagement in producing data for the public good have been recognized in many other spheres. A particularly high-profile example is the role of government agencies and researchers in producing high quality data to support the development of trustworthy artificial intelligence (Redman, 2018). The National AI Research Resources (NAIRR) Task Force noted that “the quality of many AI models depends on high-quality training and test data” and recommended that the “curation of AI data, models, tools, and workflows should be done by the user community in an AI data commons, facilitated by the NAIRR search and discovery platform” (Office of Science and Technology Policy, 2023).
There are other executive mandates. In 2013, OSTP issued the ‘Holdren memo’ requiring researchers to provide access to peer-reviewed publications and develop a data management plan that details how they would share their digital data (Holdren, 2013). In 2022, OSTP expanded the mandate in the ‘Nelson memo,’ which provided explicit timelines and requirements for federally funded scientific researchers to share the data they produce with public dollars (Nelson, 2022).
There have been substantial efforts to create greater value out of current investments. Research oriented groups, like the Research Data Alliance, DataCite, Crossref, Figshare, the Generalist Repository Ecosystem Initiative, and many others have been building infrastructure and tools to make it easier for researchers and government analysts to create, access, use, and credit people for research results. In general, there are common needs in data infrastructure and metadata exchange, creating flexible frameworks and methods to evaluate performance and incentivization. Transparency and accountability are particularly critical (Romer & Lane, 2022). In addition, there have been a number of projects by scholarly groups, like the National Academies’ Roundtable on Aligning Incentives for Open Scholarship and the Higher Education Leadership Initiative for Open Scholarship that have provided high-level principles with respect to engaging stakeholder groups, but they are not operational in nature and have not gained significant traction in government.
Again, much can be learned from industry experience. Industry is often better able to determine both the value and costs (including very real security and regulatory risks) associated with collecting and storing data, and it is continually honing exactly what data should and should not be collected and stored. Risks include privacy, intellectual property, and contractual violations, or citizen discomfort with surveillance or creepy recommendations (Spector et al., 2022). As governments increasingly collect more fine-grained, privacy-sensitive data, governments will also increasingly need to attend to these considerations. Indeed, the value of earlier government data-sharing mandates has been questioned, at least partly because they both rely on manual processes for reporting and create the wrong incentives (Peled, 2011, 2013). Clarifying benefits and costs would result in better data generation and publication decisions (Karpoff, 2021; Kerr, 1975; Paine & Srinivasan, 2019). In the corporate sector, this usually requires identifying a clearly defined purpose and clearly defined measures of success.
The Democratizing Data project was initiated by the National Center for Science and Engineering Statistics (NCSES) in conjunction with three other statistical agencies: the National Center for Education Statistics (NCES), the Economic Research Service (ERS), and the National Agricultural Statistical Service (NASS). This pilot project was intended to identify ways for each agency to comply with the Evidence Act and to both understand and enhance the value of agency data assets.
To their credit, agencies wanted to know how much their data sets were used, and how that usage had changed over time. They also wanted to have information about their portfolio: what topics were the data sets being used to study, and how were those topics aligned with the agencies’ mission. Finally, for each data set, they wanted more details on its use—the researchers, institutions, and geographic diversity—so they could have some sense of gaps and opportunities in their specific investments. Unfortunately, despite Section 202 of the Evidence Act requiring agencies to do these (or very related) things, they had no systematic way and no funding to do so. Consequently, efforts to comply have often floundered.
To help achieve some of these goals, the pilot project built on almost a decade’s effort to automate the search for and discovery of data set references or mentions using machine learning. Briefly, the starting point was to see how the data were used in scientific publications (Lane et al., 2022) by applying AI tools to one of the largest curated corpus of scientific publication—Scopus (Aksnes & Sivertsen, 2019; Martín-Martín et al., 2021; Singh et al., 2021). Scopus contains information about authors, topics, citations, institutions, and regions, so once a data set is found in a publication, the Scopus metadata could serve as a basis for understanding who is using public funded data, for what topics, and in what geographic and institutional locations (Figure 1).
The broad vision for the pilot has been to provide a concrete example of how a transparent and interconnected data and metadata system could be built (Mulvany, 2024).
These were the more specific objectives:
Validate the results through development of a validation tool, and engagement of the research community through presentations;
Communicate the results both to government senior managers and the broader community by means of an administrative dashboard and a website with usage dashboards, Jupyter Notebooks, and an application programming interface (API); and
Expand interest via presentations to stakeholders.
The full workflow and details are available in other articles in this special issue (Emecz et al., 2024; Hausen & Azarbonyad, 2024; Zdawcyk et al., 2024).
The resulting measures can be used to provide initial documentation of data usage for each agency. In the case of the National Center for Science and Engineering Statistics (NCSES), the model was applied to 14 data sets and a number of their high-profile reports. The results are graphically shown in Figure 2.
Clicking on any one of those elements calls up the articles and the journals they appear in. The value for federal agencies is not only that they can immediately and with minimal burden respond to Title II of the Evidence Act, but they have information on how their data are being used—or not used. They can characterize which geographic areas and types of institutions are making use of the data from a diversity, equity, and inclusion lens. They can also identify their market penetration on topics in their mission area, and find complementarities with other federal, state, and local agencies, at both the programmatic and statistical levels. As more corpora get included, like the agricultural extension reports identified by Pallotta et al. (2024) in this special issue, the reach to the broader public—again an Evidence Act mandate—will be facilitated.
The tally shows that the data assets were mentioned in over 3,500 publications and 1,400 journals by over 9,000 authors from over 2,700 institutions. Staff at NCSES can click on any data set or on a topic and instantly see how their data are used, and for what topic.
The results identified previously unknown uses of NCSES data assets—for example, the data asset Science and Engineering Indicators was mentioned in almost 1,700 publications, which were in turn cited more than 12,000 times. Separately it was possible to find that there were over 4,500 mentions of those NCSES publications in nontraditional outlets, such as Wikipedia, social media, and news outlets.
The platform thus provides a first step to informing agencies about the utility of their data, as initially measured by data mentions in scientific publications. As noted in the ACDEB report, the new contribution is that statistical agencies can measure the utility of increased access to their data (in terms of mission aligned research) against the risk.
The usage information can also inform decisions about what should be retained, and how resources could be reallocated to other agency priorities. For example, other sources of data might be identified that could produce the same information at a lower cost or allow for the reduction of redundancy in data collection. It also allows for a data producer to understand when data sources are not of high value for the entire population, but very important and impactful to a small group. A broad-based community might argue that new data resources would have greater value than narrow usage. The key to making decisions about how to allocate data resources would be to use the evidence about data use to make the decisions openly and transparently, consistent with the Evidence Act.
It is worth noting that the usage measures are only a first approximation. They are available on the Democratizing Data website, the NCSES usage dashboard, and explained in a user guide. Indeed, even though simple counts and citations are provided in the dashboard, communities can be invited to develop their own measures through both Jupyter Notebooks and the API.
In sum, the platform could help agencies increase researcher access and the usage of data for evidence as required by the Evidence Act and enable agencies and the communities they serve to directly respond to the requirements for accountability, transparency, and security (Advisory Committee on Data for Evidence Building, 2022; Hand, 2018; Hand et al., 2018). The platform can also help build public trust about how agencies combine and use data—especially for underrepresented and vulnerable groups (Chang et al., 2022). Finally, the platform could also create incentives to encourage government integrity.
It has often been said that there are two connected units of academic currency: publications and grants. The platform provides an opportunity to create a third unit of academic currency: data usage and value.
Figure 3 provides another view by providing information about authors and institutions. Here, clicking on a particular data set in the upper left panel will identify the top authors using that particular data set (sorted by citations), as well as their institutions, states, and countries. The tool also allows one to identify the authors using the agency data to study a given topic (and, again, their institutions, states, and countries). The same filter will be applied after clicking on a particular year or set of years.
This kind of information creates incentives for researchers to not only create and share data but to share knowledge about how the data can be used and valued. Researchers who are highlighted as experts (and the institutions with which they are affiliated) get credit for their data expertise, just as they currently get credit for their publications and software. It also creates incentives for publishers—they can provide a data expert recommender function to journal editors and program managers for science agencies. This not only reduces the burden on associate editors to find appropriate experts but also increases the likelihood that errors in data use can be identified in the review process. This sharing requirement could create new norms in scientific data publishing and raise the floor on the overall quality of empirical research. In essence, requiring the sharing of data is not enough of an incentive and real transformation will only occur when research communities develop robust crediting tools and then apply those tools for rewarding data reuse and repurposing.
Much more can be done, of course. As an example, a conference was held in October 2021 to bring together representatives from some of the relevant communities to discuss the machine learning models that form the basis of the current Democratizing Data platform (Lane et al., 2022; Potok, 2022) resulting from the “Show Us the Data” Kaggle competition (Lane & Stebbins, 2021).
In the conference, representatives from the research community agreed that automated tools could provide incentives for researchers to more carefully structure and curate their data as well as facilitate reuse and repurposing. They pointed out the value for usage metrics in creating replicable research (Appendix A). Representatives from academic institutions made it clear that institutions want to understand usage and improve discovery and access from their data repositories. Usage statistics could also help identify gaps in which data were being underutilized (see Appendix B). Agency chief data officers agreed that it was critical for automated tools to establish impact, to help agencies address their challenges (see Appendix C). Publishers identified the importance of a sustainable infrastructure with a central location for access. Publishers also noted the value of serving the academic community (Appendix D).
In other words, the simple usage metrics presented in the platform should serve as the beginning of a measurement discussion and the grist for incorporating data reuse and repurposing metrics in assessing academic contributions to a field. In general, for the incentive structure to work and be accepted, it will be vital for researchers and analysts to be able to validate the output and continuously update and validate the output and for promotion, tenure, and hiring committees as well as grant makers to consider the value of such metrics. Another article in this special issue describes a user engagement approach in much more detail (Chenarides, 2024).
Other approaches were used in the Democratizing Data platform to develop community-driven measures in addition to conferences. The most successful was through access to SciServer. SciServer is a science platform built and supported by the Institute for Data Intensive Engineering and Science at Johns Hopkins University. The data model underpinning the usage metrics is available through Jupyter Notebooks on SciServer with the goal of encouraging community input. SciServer has been useful in similar domains; for example, Galaxy Zoo, a citizen science project resulted in reliable classifications of hundreds of thousands of galaxy images—more than 40 million classifications were made in approximately 175 days by more than 100,000 volunteers (Lintott et al., 2011). In addition, data are accessible through an API so that developers or technical experts can produce their own visualizations and metrics.
Much more can be done to develop both engagement and measurement strategies.
A proactive approach to engagement could be informed by the way in which the private sector has succeeded by creating trust and value in such online platforms as Airbnb, Yelp, and Tripadvisor (Holikatti et al., 2019; Reinhold & Dolnicar, 2018). Leaderboards based on aggregate metrics have also been proven to be successful in creating collaborative knowledge-building systems. For example, Google’s Kaggle competitions reward participants who contribute in four domains—competitions, data sets, notebooks, and discussions (Cheng & Zachry, 2020). The enormous engagement of computer scientists in contributing to Jupyter Notebooks was driven by a well-designed architecture and enthusiastic user base (Granger & Pérez, 2021; Perkel, 2018).
Much can also be done in the measurement sphere. Next steps should include ways to i) extend what is measured, ii) address the complexity of data linkage, iii) extend coverage efforts, and iv) thoughtfully address possible measurement failures.
The initial focus of the project has been on measuring data set usage. This is being extended to include data assets, such as analytical reports and briefs produced by government agencies. The work could be extended to include the data products produced by public data sets—such as the monthly unemployment rate.
The project has also focused on single data sets. However, as noted by both the Commission for Evidence-Based Policymaking and the ACDEB, many high-value data sets are constructed by linking data from multiple agencies. Tracing the value of those linked data sets back to contributing agencies would be an area for additional research.
It is notable that the current measures are limited by the coverage of Scopus. There are many important other sources that should be included: an illustrative but not exhaustive list would include the grey literature, reports targeted at direct users like agricultural extension program reports (Pallotta et al., 2024), newspaper reports, and social media. Future research could include ways for the public and researchers to contribute corpora that document the usage of public data sets.
Finally, there are number of ways in which metrics can fail. One is codified in Goodhart’s Law, which states that, “When a measure becomes a target, it ceases to be a good measure” (Manheim & Garrabrant, 2018). The Matthew effect, which addresses errors in recognizing the individuals most responsible for creating scientific advances (Merton, 1968) also has bearing on the difficulty of evaluating data sets, as both the reputation of a publisher or inflated (but nonindicative) usage metrics can distort measurements of their value.
We do not minimize the complexity of doing quality measurement of enduring value. But, with considerable effort, we believe it is possible to minimize distortions. For example, we can learn from the endless battle with publishers who aggressively game search engine algorithms to promote their content. As examples:
To counter Goodhart’s Law, search engines change the data they gather and change its interpretation. The government could periodically change metrics and alter their interpretation.
Search engines attempt to collect data that is less gameable, for example, ‘long clicks’ that indicate a user has significantly engaged with a returned page. The government could strive to directly measure data use, aspirationally by recording contributions to policy and other valuable endpoints.
Search engines use AI increasingly to reject tell-tale problems and locate true value (Google Search Central Blog, 2022). The government could use these approaches to look for patterns of abuse or impact.
More generally, using data to motivate the proper production and use of data is a data science problem requiring all the care and considerations that such problems need (Spector et al., 2022). Among other things, a broader research agenda would be to structure the incentives in such a way as to create a closer overlap between the goals of researchers and those of agencies.
A new framework for public data—including both government fiat and appropriate incentives—is greatly needed to ensure that we can extract as much value as possible for the public good. That requires data on the use of data. The results of government fiat have been mixed over the past decade. As noted in the Context section, in 2013, John Holdren, the Director of the White House Office of Science and Technology Policy issued a directive (the ‘Holdren memo’) to federal agencies to require data management plans of all researchers applying for federal research funding.
It also required that each federal agency subject to the directive develop a strategy for improving the public’s ability to locate and access the resulting digital data. While agencies largely complied with the letter of the directive, few embraced the spirit or subsequent calls for making data produced with public dollars readily available. As a result, the Biden administration issued a similar directive in 2022 (the ‘Nelson’ memo, named after acting OSTP director Alondra Nelson) that now requires agencies to develop plans to require the data underlying the conclusions of scientific research papers to be made freely available at the time of publication.
Internal conversations with agencies make it clear that implementation is going to be messy and complicated. Indeed, new policies that mobilize data through access will further stress the system as demands for a more thorough review of research data grow. There have been a number of discussions on how to handle this aspect of opening access to research data, but those conversations and plans should begin now.
This article proposes taking advantage of the Evidence Act mandate to inform those discussions. It proposes a practical, incentive-based approach that integrates policies, infrastructure, and tools across federal agencies to better serve the scientific enterprise and public. It complements NIST’s Research Data Framework (Hanisch et al., 2023), and the combined work could set the stage for the federal government to invest in a federated research data commons that has the right set of incentives in place.
There is even potential for the Democratizing Data to improve data quality in other areas such as AI. For example, the final report of the NAIRR Task Force noted that the quality of many AI models depends on high-quality training and test data. It recommended a community-driven approach to the curation of AI data, models, tools, and workflows in an AI data commons, facilitated by a search and discovery platform that documented data use (Office of Science and Technology Policy, 2023).
While there are certainly many possible improvements to the technology described in this special issue, it does represent a successful approach that could be calibrated to create an ‘invisible hand’ for data production and use, just as Adam Smith, in 1776, argued for incentives for the brewer, baker, and butcher.
We are very grateful to Stuart Feldman for his insightful comments, and the very helpful suggestions of two anonymous referees.
The authors acknowledge the support of the Patrick J McGovern foundation, the National Center for Science and Engineering Statistics (NCSES) of the US National Science Foundation (NSF) and the Economic Research Service (ERS) and the National Agricultural Statistics Service (NASS) of the US Department of Agriculture (USDA).
Advisory Committee on Data for Evidence Building. (2022). Advisory Committee on Data for Evidence Building: Year 2 report. U.S. Bureau of Economic Analysis. Retrieved from https://www.bea.gov/system/files/2022-10/acdeb-year-2-report.pdf
Aksnes, D. W., & Sivertsen, G. (2019). A criteria-based assessment of the coverage of Scopus and Web of Science. Journal of Data and Information Science, 4(1), 1–21. https://doi.org/10.2478/jdis-2019-0001
American Statistical Association. (n.d.) Principal statistical agencies: Rising to the challenge by working together. Priorities for the 117th Congress and 2021-2025 administration. https://www.amstat.org/asa/files/pdfs/POL-Principal_Statistical_Agencies_Priorities2021plus.pdf
Brynjolfsson, E., & McElheran, K. (2016). The rapid adoption of data-driven decision-making. American Economic Review, 106(5), 133–139. https://doi.org/10.1257/aer.p20161016
Chang, W.-Y., Garner, M., Owen-Smith, J., & Weinberg, B. (2022). A linked data mosaic or policy-relevant research on science and innovation: Value, transparency, rigor, and community Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.1e23fb3f
Chenarides, L. (2024). Engaging users in data democratization. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.24c26aa9
Cheng, R., & Zachry, M. (2020). Building community knowledge in online competitions: Motivation, practices and challenges. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW2), Article 179. https://doi.org/10.1145/3415250
CHIPS Act of 2022, Pub L. No. 117–167, 136 Stat. 1366 (2022). https://www.congress.gov/117/plaws/publ167/PLAW-117publ167.pdf
Democratizing Data Project. (2024). Democratizing Data Platform. https://democratizingdata.ai
Emecz, A., Mitschang, A., Zdawcyk, C., & Dahan, M. (2024). Building the process workflow and implementation infrastructure. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.d8a3742f
Executive Order No. 13642. (2013). Executive Order No. 13642 (2013). Making open and machine readable the new default for government information. https://obamawhitehouse.archives.gov/the-press-office/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-
Farboodi, M., & Veldkamp, L. (2023). Data and markets. Annual Review of Economics, 15, 23-40. https://doi.org/10.1146/annurev-economics-082322-023244
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2018).
Freedman, L. P., Cockburn, I. M., & Simcoe, T. S. (2015). The economics of reproducibility in preclinical research. PLOS biology, 13(6), Article e1002165. https://doi.org/10.1371/journal.pbio.1002626
Galloway, S. (2018). The four: The hidden DNA of Amazon, Apple, Facebook, and Google. Penguin.
Google Search Central Blog. (2022). How we fought Search spam on Google in 2021. https://developers.google.com/search/blog/2022/04/webspam-report-2021
Granger, B. E., & Pérez, F. (2021). Jupyter: Thinking and storytelling with code and data. Computing in Science & Engineering, 23(2), 7—14. https://doi.org/10.1109/MCSE.2021.3059263
Hand, D. J. (2018). Aspects of data ethics in a changing world: Where are we now? Big Data, 6(3), 176–190. https://doi.org/10.1089/big.2018.0083
Hand, D. J., Babb, P., Zhang, L.-C., Allin, P., Wallgren, A., Wallgren, B., Blunt, G., Garrett, A., Murtagh, F., & Smith, P. W. (2018). Statistical challenges of administrative and transaction data. Journal of the Royal Statistical Society: Series A (Statistics in Society), 181(3), 555–605. https://doi.org/10.1111/rssa.12315
Hanisch, R., Kaiser, D. L., Yuan, A., Medina-Smith, A., Carroll, B. C., & Campo, E. (2023). NIST Research Data Framework (RDaF): Version 1.5. NIST, U.S. Department of Commerce. https://www.nist.gov/publications/nist-research-data-framework-rdaf-version-15
Hanisch, R. J., Kaiser, D. L., Carroll, B. C., Higgins, C., Killgore, J., Poster, D., & Merritt, M. (2021). Research Data Framework (RDaF): Motivation, development, and a preliminary framework core. NIST Special Publication 1500-18. U.S. Department of Commerce. https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.1500-18.pdf
Hausen, R., & Azarbonyad, H. (2024). Finding the data: An ensemble approach to uncovering the prevalence of government-funded datasets. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.18df5545
Holdren, J. P. (2013). Increasing access to the results of federally funded scientific research [Memo]. White House Office of Science and Technology Policy, Executive Office of the President of the United States. http://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
Holikatti, M., Jhaver, S., & Kumar, N. (2019). Learning to Airbnb by engaging in online communities of practice. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), Article 228. https://doi.org/10.1145/3359330
Ioannidis, J. P. (2005). Why most published research findings are false. PLOS medicine, 2(8), Article e124. https://doi.org/10.1371/journal.pmed.0020124
Karpoff, J. M. (2021). On a stakeholder model of corporate governance. Financial Management, 50(2), 321–343. https://dx.doi.org/10.2139/ssrn.3642906
Kerr, S. (1975). On the folly of rewarding A, while hoping for B. Academy of Management Journal, 18(4), 769–783.
Lane, J. (2021). Democratizing our data: A manifesto. MIT Press.
Lane, J. (2023). Reimagining labor market information: A national collaborative for local workforce information. https://www.aei.org/research-products/report/reimagining-labor-market-information-a-national-collaborative-for-local-workforce-information/
Lane, J., Gimeno, E., Levistkaya, E., Zhang, Z., & Zigoni, A. (2022). Data Inventories for the Modern Age? Using Data Science to Open Government Data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.8a3f2336
Lane, J., & Stebbins, M. (2021). Show US the data. Kaggle. https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data
Lintott, C., Schawinski, K., Bamford, S., Slosar, A., Land, K., Thomas, D., Edmondson, E., Masters, K., Nichol, R. C., & Raddick, M. J. (2011). Galaxy Zoo 1: Data release of morphological classifications for nearly 900 000 galaxies. Monthly Notices of the Royal Astronomical Society, 410(1), 166–178. https://ui.adsabs.harvard.edu/link_gateway/2011MNRAS.410..166L/doi:10.1111/j.1365-2966.2010.17432.x
Manheim, D., & Garrabrant, S. (2018). Categorizing variants of Goodhart's Law. ArXiv. https://doi.org/10.48550/arXiv.1803.04585
Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). Google Scholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’ COCI: A multidisciplinary comparison of coverage via citations. Scientometrics, 126(1), 871–906. https://doi.org/10.1007/s11192-020-03690-4
Merton, R. K. (1968). The Matthew effect in science: The reward and communication systems of science are considered. Science, 159(3810), 56–63. https://doi.org/10.1126/science.159.3810.56
Mulvany, I. (2024). A publishing perspective on the power of data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.9e9ae62b
Munafò, M. R., Nosek, B. A., Bishop, D. V., Button, K. S., Chambers, C. D., Percie du Sert, N., Simonsohn, U., Wagenmakers, E.-J., Ware, J. J., & Ioannidis, J. (2017). A manifesto for reproducible science. Nature Human Behaviour, 1 (1), Article 0021. https://doi.org/10.1038/s41562-016-0021
Nelson, A. (2022). Ensuring free, immediate, and equitable access to federally funded research [Memo]. White House Office of Science and Technology Policy, Executive Office of the President of the United States. https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf
Nelson, B. (2009). Data sharing: Empty archives. Nature, 461(7261), 160–163. https://doi.org/10.1038/461160a
Norwood, J. L. (1995). Organizing to count: Change in the federal statistical system. The Urban Institute. https://webarchive.urban.org/publications/205979.html
Office of Science and Technology Policy. (2023). Strengthening and democratizing the U.S. Artificial Intelligence Innovation Ecosystem: An Implementation Plan for a National Artificial Intelligence Research Resource. https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf
Paine, L. S., & Srinivasan, S. (2019, October 14). A guide to the big ideas and debates in corporate governance. Harvard Business Review. https://hbr.org/2019/10/a-guide-to-the-big-ideas-and-debates-in-corporate-governance
Pallotta, N., Lane, J., Locklear, M., Ren, X., Robila, V., & Alaeddini, A. (2024). Searching for data assets in non-traditional publications: Discovering use and identifying new opportunities. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.77bfa1c9
Peled, A. (2011). When transparency and collaboration collide: The USA open data program. Journal of the American Society for Information Science and Technology, 62(11), 2085–2094. https://doi.org/10.1002/asi.21622
Peled, A. (2013). Re-Designing Open Data 2.0. JeDEM-eJournal of eDemocracy and Open Government, 5(2), 187–199. https://doi.org/10.29379/jedem.v5i2.219
Perkel, J. M. (2018). By Jupyter, it all makes sense. Nature, 563(7729), 145–146. https://doi.org/10.1038/d41586-018-07196-1
Potok, N. (2022). Show US the data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.9d13ba15
Redman, T. C. (2018, April 2). If your data is bad, your machine learning tools are useless. Harvard Business Review. https://hbr.org/2018/04/if-your-data-is-bad-your-machine-learning-tools-are-useless
Reinhold, S., & Dolnicar, S. (2018). How Airbnb creates value. In S. Dolnicar (Ed.), Peer-to-peer accommodation networks (pp. 39–53). Goodfellow. https://doi.org/10.23912/9781911396512-3602
Reynolds, J. P., Stautz, K., Pilling, M., van der Linden, S., & Marteau, T. M. (2020). Communicating the effectiveness and ineffectiveness of government policies and their impact on public support: A systematic review with meta-analysis. Royal Society Open Science, 7(1), Article 190522. https://doi.org/10.1098/rsos.190522
Romer, P., & Lane, J. (2022). Interview with Paul Romer. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.cc0da717
Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., & Aroyo, L. M. (2021). “Everyone wants to do the model work, not the data work”: Data cascades in high-stakes AI. In Y. Kitamura et al. (Eds.), Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems (Article 39). ACM. https://doi.org/10.1145/3411764.3445518
Singh, V. K., Singh, P., Karmakar, M., Leta, J., & Mayr, P. (2021). The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis. Scientometrics, 126, 5113–5142. https://doi.org/10.1007/s11192-021-03948-5
Smith, A. (1776). An Inquiry into the Nature and Causes of the Wealth of Nations. Aegitas.
Spector, A. Z., Norvig, P., Wiggins, C., & Wing, J. M. (2022). Data science in context: Foundations, challenges, opportunities. Cambridge University Press.
Tambe, P., Hitt, L., Rock, D., & Brynjolfsson, E. (2020). Digital capital and superstar firms. NBER Working Paper No. w28285. National Bureau of Economic Research. https://doi.org/10.3386/w28285
The Commission on Evidence-Based Policymaking. (2017). The promise of evidence-based policymaking. https://www2.census.gov/adrm/fesac/2017-12-15/Abraham-CEP-final-report.pdf
U. S. Government Accountability Office. (2023). Evidence-based policymaking: Practices to help manage and assess the results of federal efforts. (GAO-23-105460). https://www.gao.gov/products/gao-23-105460
Watson, C. (2022). Many researchers say they’ll share data—But don’t. Nature, 606(7916), 853-853. https://doi.org/10.1038/d41586-022-01692-1
Zdawcyk, C., Lane, J., Rivers, E., & Aydin, M. (2024). Searching for how data have been used: Intuitive labels for data search and discovery. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.f1cbbfbb
The Coleridge Initiative convened a panel of experts representing the perspective of researchers to get their input about the results of the “Show US the Data” challenge before the October 20, 2021, conference. The goal of the panel was to gain understanding about a new machine learning approach to identifying public uses of agency data, identify strengths and weaknesses of the approach, discuss how researchers would draw on this usage information captured by live data streams, and suggest ways to incorporate feedback from the public on both the usage documentation and on the data sets. The panel discussion was summarized and incorporated in the October conference.
The participants were asked a series of structured questions including: (1) how they might use the tools to advance their research; (2) how the tools might advance the work of junior researchers; (3) how the tools might inspire researchers to do their work differently; and (4) how might the researcher community become engaged in this effort? Below are some key highlights summarized.
Providing the right incentives for researchers facilitates success and encourages use and feedback to improve the system. The tools can assure that the burden is not all on the researchers to provide their publication data. Rather, a positive feedback loop could be created by researchers having their citations and publications included, getting people to advertise their work, giving seminars, and sharing their data and best practices in terms of citing data, so that their work can get acknowledged. This also could lead to improvements such as a uniform citation for a data set.
The tools allow researchers to make connections between what data sets are being used and for what purposes—allowing researchers to build on what is already been done. Making connections also highlights which data sets may be underused. The tools can also foster partnerships between academic researchers and government agencies that have data the researchers are using. Those two-way relationships can also help improve the data sets’ accuracy and usability.
The tools foster community and mentorship, helping junior researchers use data to knit together people and research to impact their work. Junior researchers could discover new data sets and ways to use existing data sets and gain visibility if their research is represented in the database.
An interactive partnership approach between agencies and the research community can help agencies prioritize by seeing how data are being used by others. Agencies can use the buy-in of researchers—documented as high use of certain data sets—to demonstrate its importance to Congress, call attention to underutilized data, and make investments in data improvement.
Several ideas were put forth to encourage researcher community engagement:
Researchers could be incentivized by providing curation tools—for example, finding related data sets by joining data sets and cleaning up the data. Agencies could provide links to tools for cleaning and linking the data. Some researchers may not know how much data are available to them from agencies. Some younger researchers find this out only by asking more senior researchers.
Access to gray literature (research that has not been published in a peer-reviewed journal but is available in libraries of universities and elsewhere) could be incredibly valuable, creating communities around working papers and even avoiding publication bias.
The tools offer further opportunity for development, such as information regarding authors (name, email addresses, etc.) that could be harvested or a citation index drawn on for collaborators, allowing authors and other researchers in the field to build up a network to share information about certain metadata and otherwise clarify uncertainties, fill gaps, and improve overall use. Researchers could automatically share information about the quality of a data set.
The panel identified the key next actions, as follows:
The project should stay focused on the value-add and allow for exciting developments.
Researchers should be able to see how easy the tool is to use and immediately see the value. Engaging high profile users—research ‘influencers’—could also be a great way to set a trend, among others.
The Academic Institutions workshop was held in conjunction with three other workshops (Chief Data Officers, Researchers, and Publishers) to answer questions and gather input to feed into the Coleridge Initiative Show US the Data Conference on October 20, 2021. Structured questions were asked to get feedback on what the academic institution stakeholder community thinks about Machine Learning/Natural Language Processing.
The participants discussed several issues and brought up the key points below:
Several benefits for researchers at institutions included improved discovery of what data exist and are available, better access to data, and opportunities for collaboration, especially across disciplines. More use of the data would also create motivation to improve the metadata, for example, developing and conforming to metadata and citation standards and making sure data are complete. This would also help improve existing governance structures and help integration across existing infrastructures.
Institutions want to understand usage and improve discovery and access from their data repositories. Institutions also use a lot of state and other data, so there could be wider applications beyond federal data. The application could also help identify gaps in which data were being underutilized. In addition, preservation policies for data could use data to support decisions, for example, a librarian to check after a period of time to see if data has been used, and if no one has used the data, could archive it or stop maintaining it. The cumulative costs of maintaining repositories are going to be important and will be impacted by the use case of determining which data sets should be kept for what period of time.
Some land grant universities have close relationships with federal agencies such as the United States Department of Agriculture (USDA). It could be helpful to consider pilot projects that build on these relationships.
Other discussion points included:
Concerns. Participants expressed a desire for data beyond those in scientific publications (‘gray’ literature, other media), ensuring the accuracy of data included in the dashboards, and establishing a mechanism for feedback on what agencies are doing with public comments and suggestions on improving the data. Privacy concerns related to the ability of competing institutions and researchers to view the data set details that an institution is using, particularly prepublication. Questions arose on who would run such a service and be responsible for protecting privacy, uncovering potential bias, and usability.
1. A central host was generally favored but institutions also wanted to be able to host specific search and display capabilities, particularly if ‘gray’ literature that is held by an institutional library could be included. Possibilities include institution repositories or research information management systems or library discovery environments. For example, the University of Michigan has already invested into building out the crosswalks between the institution repository and the research information management system. The usage data should also be available to view at the site where the agency is providing the data.
Value. Data citations could lead to tenure or other salutary job impacts for researchers. In addition, consortium approaches often work if incentives are created to benefit individual institutions and the group as a whole. Helping to rationalize the current system would also provide value as there are competing tools, and it is unclear which data are where, in what format, and in what detail. If agencies would provide standard citation info for the data sets, that would be helpful.
On September 21, 2021, the Coleridge Initiative convened a panel of chief data officers (CDO) to request their input about the results of the “Show US the Data” competition before the October 20 workshop. The goal of the CDO expert pre-session was to develop a point of view from representative CDOs regarding the applicability of the learnings and tools developed in the “Show US the Data” competition. Since the competition focused on uses of data sets in research, the outcomes were most immediately appliable to agencies with scientific mission components. CDOs from agencies for which discovery activities occurred in the competition were invited to review results in one-on-one sessions and then to attend this panel discussion. The agencies represented in the discussion were Commerce (National Oceanic and Atmospheric Administration), National Science Foundation, USDA, and Transportation. Given that the breadth of data work in an agency may cross many mission teams, some agencies had multiple team members participate in the discussion session.
Specific questions were posed to identify ways that the approach and algorithms might be used to support agency mission activities both near-term and strategically, including (1) how the capabilities might be used; (2) opportunities for near-term use in the agencies; (3) potential obstacles to use; (4) key points of engagement; and (5) proposed next steps.
Details of the session are included below, but a few overarching themes emerged:
These tools can support emerging research themes, connect researchers to previously undiscovered data sets for stimulating new discovery, and providing evidence of citizen benefits. As such, they are more useful for prioritizing resources and work efforts for making public data available for research and public uses than as simply a pathway to achieve compliance with the Open Government Data Act, also known as Title II of the Evidence Act and other mandates. They can drive broader visibility and transparency about data sets and their uses both within and outside the agencies. Most impactful would be creating communities around connecting and creating meaningful exchanges between the users of the data and those producing and maintaining the data.
One important barrier to use is the agencies’ cultural resistance to change. Other barriers include lack of current workforce skills and competing priorities for resources within agencies.
Building greater visibility and engagement would require a significantly expanded awareness outreach effort that could include boards, special purpose groups, councils, and civic tech organizations
The discussions about near-term use and next steps coalesced into a common point—identification of specific use cases within federal agencies to sponsor application of the approach and tools followed by analysis and capture of learnings from each step along the process-priority setting, workforce, barriers, and engagement model.
It was suggested by the participants that the October session include some dialogue about potential use cases so that there might be collective sponsorship and support for the next steps.
The Publisher workshop was held in conjunction with three other workshops (Chief Data Officers, Researchers, and Academic Institutions) to answer questions and gather input to feed into the Coleridge Initiative “Show US the Data Conference” on October 20, 2021. Structured questions were asked to get feedback on what the academic institution stakeholder community thinks about Machine Learning/Natural Language Processing.
The workshop participants were asked structured questions to get feedback on what the publisher stakeholder community thinks about the potential of the data search and discovery project and its machine learning and natural language processing components. The questions related to: (1) Concerns about the Machine Learning / Natural Language Processing (ML/NLP) approach to capturing data use; (2) additional functionality that would be useful; (3) the value proposition for publishers to participate; (4) How publishers could participate; and (5) where should the application reside and be managed?
The participants raised several points, including:
This initiative needs to be a sustainable infrastructure where there is funding for work that is produced, and there is value in producing a high-quality curated corpus. There should be transparency in any pricing model. The small-to-medium publishers have valuable content and contributions but have a lower level of sophistication, which may impact the rate of adoption.
There should be a central place, such as data.gov, where this information can be accessed. In addition, a publisher dashboard maintained for smaller publishers could be very helpful so that publishers could also see how data are being used, citations, and new services that publishers could provide
One of the biggest challenges with reusing and understanding the ongoing value of data sets is how much metadata is there and how much context there is around that data. Researchers are also funded in a way that they do not have access to those government repositories and are left with fewer choices to put their data so they end up in general repositories like Figshare, Dryad, and so on. General repositories are not very helpful for building on research unless they are able to pull in the required metadata. There is a need for greater incentives for authors to comply with open data policies. If publishers make this more findable and prominent and enable credit as a first-class object, incentives, quality, services, and compliance will increase.
Other discussion points raised included:
Value proposition: Many publishers are investigating services that they might provide in relation to identification and analysis in getting data. Is this a free substitute for something that they would like to provide a service as part of a publisher’s offerings? What is the value proposition for publishers?
Bias: Having machine learning drawing conclusions about how data are being used may not lead to the most accurate insights. How can human interaction be added into the model to improve the results and so that the accuracy will continue to grow.
Relative importance of two main use cases: (1) a compliance-driven use case—for agencies to show that they are tracking reuse per the mandate; and (2) providing a means to discover data. To what extent has the relative importance of these use cases been established with users?
Risk of using NLP to capture data: in making our entire full-text XML corpus available to do the work, how to ensure the content was only used for this purpose, by a controlled group, and deleted afterwards. This does not relate to concerns about the job itself (publishers do make content available to third parties for indexing, abstracting, etc.).
A link back to the publisher is critical, ultimately building an informal citation network. Can the community develop different visualizations to suit their needs?
Publishers can participate in multiple ways, including allowing indexing services to use their content for this purpose; or running the ML/NLP algorithms internally on their content. Harvesting is currently allowed by some publishers.
A public–private partnership that would be friendly to international users and be resilient to U.S. administrative government change could be considered to run this function, with central access available.
A broad general value proposition is enhancing your value (both quantitatively and qualitatively) to the community you are trying to serve. There are also non-financial benefits for publishers, such as increased usage and citations. Publishers want to comply with funder goals, and a long-term solution is more around formal citation (as mentioned earlier), and value proposition on ‘win-win’ content/data discovery; are links between the two (more consumers of government data and more consumers of published articles) the technical and business challenges in creating a long-term solution? Could a combined effort be established?
There is a need for equity across publishers and potentially there was additional enabling needed for smaller publishers. Most are doing things in a different way, so thinking in terms of using a broker where there is a degree of standardization may be a good idea.
The participants agreed that it may make sense to start off with a pilot on one or two specific topics
©2024 Julia Lane, Alfred Spector, and Michael Stebbins. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.