Column Editor’s Note: In this Effective Policy Learning piece, the author, Nancy Potok, a Visiting Fellow at RTI International, describes how the Democratizing Data Search and Discovery Platform is making great strides in helping meet the mandates of the Foundations for Evidence-Based Policymaking Act of 2018 (Evidence Act) and demonstrating the value of federal data. The act encourages federal agencies to improve the way they use data for evidence-based decision-making. It emphasizes the importance of data sharing, data transparency, and data accessibility across government agencies. To discover who is using federal data, the search and discovery platform allows agencies to track metrics through machine learning and language models to gain a deeper understanding of use. This column is an introduction to a deeper dive exploring the collection and value of usage data that will be published in a special HDSR issue, “Democratizing Data: Data as a Public Asset,” planned for spring 2024.
Keywords: evidence, data access, usage, metrics, federal, data set
The Foundations for Evidence-Based Policymaking Act of 2018, also known as the Evidence Act, is a significant piece of legislation passed in the United States and signed into law on January 14, 2019. The Act aims to improve the federal government's use of evidence and data in decision-making processes to enhance the effectiveness and efficiency of government programs and services (Potok, 2019). It also seeks to improve public access to both open and protected government data.
Key provisions and goals of the Evidence Act include:
Establishment of Chief Data Officers (CDOs): Requires each federal agency to designate a Chief Data Officer responsible for managing and utilizing data to support evidence-based policymaking.
Creation of Evidence-Building Plans: Requires agencies to develop and implement evidence-building plans to identify data and evidence needs, assess available data, and use evidence to inform policymaking.
Open Data Initiatives: Promotes the use of open data standards and encourages agencies to make their data more accessible and transparent to the public.
Data Governance Frameworks: Establishes standards for data governance, including data inventory, quality, security, and privacy, to ensure that data is used effectively and responsibly.
Data Access and Use Agreements: Encourages federal agencies to enter into data-sharing agreements, both internally and externally, to facilitate the exchange of data for evidence-building purposes.
Statistical and Evaluation Activities: Supports the capacity-building of federal agencies in terms of conducting rigorous program evaluations and statistical analyses to generate evidence for policymaking.
Privacy and Security Protections: Emphasizes the importance of protecting sensitive and personally identifiable information while promoting data sharing and evidence-based policymaking.
Federal Data Centers: Enables the establishment of federal data centers to provide expertise and resources for data analysis and evidence-building activities.
Evaluation and Oversight: Mandates periodic evaluations of evidence-building activities and requires agencies to report their progress to Congress.
The Evidence Act and its subsequent implementation directives issued by the U.S. Office of Management and Budget represent a significant step toward modernizing the federal government's approach to data and evidence, with the goal of improving decision-making, reducing inefficiencies, and increasing transparency. It reflects a broader trend in government toward evidence-based policymaking and the use of data-driven insights to inform policy and program design and evaluation.
In addition, the Federal Data Strategy (Office of Management and Budget [OMB], 2020) and OMB Circular A-130 (Office of Management and Budget [OMB], 2016) include the goals of improving the use of open data produced by the government and making it more easily accessible for both open science and commercial uses. This is also reflected in Title II of the Evidence Act, known as the Open, Public, Electronic and Necessary (OPEN) Government Data Act . In addition to mandating several actions on the part of federal government agencies to advance the open data goals, such as data inventories, interoperability, and other data management actions, Title II, Section 202 (c) mandates that federal agencies, “facilitate collaboration with non-Governmental entities (including businesses), researchers, and the public for the purpose of understanding how data users value and use government data” (2018). It requires that agencies engage the public in using public data assets of the agency and encourage collaboration by publishing on the website of the agency on a regular basis (but not less than annually) information on the usage of such assets by non-Government users. Agencies must also assist the public in expanding the use of public data assets.
There are many reasons why information on data usage is important to agencies (Lane et al., 2022):
Public Engagement and Transparency: Transparency is a key principle of the Evidence Act and open science. Knowing who is using federal data contributes to transparency by allowing the public to understand how government data is being leveraged to inform decision-making.
Improving Data Access and Usability: Agencies can use data usage information to tailor data access and sharing policies to better meet the needs of specific user groups. Agencies can also improve the usability of their data by getting feedback from the user community and providing tools, resources, and training that enable people to work with data effectively. This includes creating user-friendly interfaces, offering data visualization tools, and providing educational resources on data analysis.
Inclusivity: Better policies can be developed when there is an opportunity for input from a diverse range of users, regardless of their technical expertise or background. Ensuring that data is accessible and useful to people from different disciplines, industries, and communities is an area that could benefit from a proactive stance from agencies. With usage data, agencies can identify groups who may be underrepresented in accessing federal data assets, such as researchers from smaller colleges and universities and minority serving institutions and can conduct outreach activities to expand access. This also helps agencies connect with new groups of researchers who may be interested in studying topics of particular interest to the agencies. Agencies can also do more to connect with those outside the researcher community, such as state, tribal, and local governments.
Value of Data: Understanding who is using federal data and what topics they are researching helps in assessing the value of the data being collected and disseminated by the government.
Resource Allocation: Data usage information helps in making informed decisions about resource allocation. Agencies can allocate resources more effectively by identifying which data sets are in high demand and ensuring that the most valuable data resources are adequately maintained and made available.
Supporting Research and Innovation: Information on how data are being used and what topics are being studied can help support expanded research and innovation. Agencies and researchers can form collaborative communities around topics of interest and expand participation to broader interest groups, such as community organizations (Potok, 2022).
Accountability and Oversight: Knowing who is using federal data allows for greater accountability and oversight of the implementation of the Evidence Act. It enables agencies, Congress, and the public to track how data is being accessed and used.
Overall, tracking who is using federal data is essential for ensuring that goals of enhancing evidence-based policymaking, improving government efficiency, and increasing transparency while safeguarding privacy and security can be met. Unfortunately, most government agencies are not able to track and understand how their data are being used beyond using basic means, such as counting how many data asset downloads occur from their websites, or using basic citation search algorithms, which are woefully incomplete.
However, the Democratizing Data project has developed a key advance: creating metadata that describes who is using the data, what topics they are studying, what institutions they are at, when the research was conducted, and other key information. This new approach uses a search and discovery platform that describes how data assets identified by federal agencies have been used in scientific research. It uses machine learning algorithms to search over 90 million documents and find how data assets are cited, in what publications, and what topics data assets are being used to study. The resulting data is put in a metadata format that can be accessed through interactive dashboards, on an application processing interface (API), and in Jupyter Notebooks.
This exciting project has been piloted by a collaborative partnership including academics, the private sector, and federal agencies. New York University (NYU) provides the overall coordination and management of the effort to develop and implement the platform. (Disclosure: I am on the NYU team.) The Institute for Data Intensive Engineering and Science (IDIES) at Johns Hopkins University ingests and processes the metadata output from the machine learning algorithms into a database that can be validated using a validation tool, then feeds the validated output to the API. IDIES also developed SciServer, a collaborative, web-based science platform originally used for citizen science in other scientific disciplines that has been made available for collaborative work on this project. The Texas Advanced Computing Center (TACC) at the University of Texas at Austin designed, developed, and is implementing the API to disseminate the validated metadata. TACC also has enabled the front-end tool implementation by enabling the web connector for commercial software from Tableau for visualization in a dashboard. The University of Maryland, College Park developed a website that describes the methodology, approach, and relevant materials for the user community. Elsevier, the international publishing and information company, provides the information infrastructure, identifies the curated corpus of documents, optimizes the search space, applies the algorithms, and produces metadata. On the government side, the National Center of Science and Engineering Statistics at the National Science Foundation, National Center for Education Statistics at the Department of Education, Economic Research Service and National Agricultural Statistics Service at the U.S. Department of Agriculture, and National Oceanic and Atmospheric Administration at the Department of Commerce have led the pilot project efforts. This cross-sector collaboration has yielded remarkable results and can be viewed at the Democratizing Data website. Other federal agencies have been closely following developments and may join the effort.
One example of how agencies are using these data can be found at the 5 Ws of NASS Data Usage Dashboard. The National Agricultural Statistics Service (NASS) downloaded metadata from the Democratizing Data API and created a dashboard that describes who, what, when, where, and why its data are being used. NASS also has used the metadata in the dashboard to run an experiment to determine the value of the information to survey respondents in order to understand whether targeted information given to respondents would increase response rates. Declining response rates are a significant problem for statistical agencies that rely heavily on surveys for data collection.
It is important for federal agencies to continually assess and improve their tools and processes for monitoring data usage to ensure compliance with the Evidence Act and OMB directives and to enhance the effectiveness of evidence-based policymaking across the government. Much remains to be done to better understand how data are being used and then to increase access to both open and protected data collected and disseminated by the U.S. agencies. The next step underway for the Democratizing Data project includes expanding beyond the corpus of scientific publications to a variety of other publications. For example, the U.S. Cooperative Extension System organizations publish a variety of reports read by farmers and others. Tracking how these reports incorporate data assets from the U.S. Department of Agriculture is challenging, because the data assets are not labeled in a way that makes their use easily cited in a standard way in the extension reports. In another instance, many conference papers are never published in peer-reviewed journals but are of significant interest to agencies because the work relies on federal data assets. Policy documents such as white papers or reports from the U.S. Congressional Research Service also may rely heavily on government-produced data but do not always cite those data in ways that are easily searchable. These challenges exist among all government agencies and may be particularly acute for agencies with significant amounts of open data.
By building on the historical context of evidence-based policy initiatives and emphasizing collaboration between agencies and data users, a comprehensive approach, such as the Democratizing Data Search and Discovery Platform, holds the promise of fostering more informed, efficient, and accountable governance. As the landscape of data management and analytics continues to evolve, the future of evidence-based policymaking is poised for even greater advancements. Data scientists can make major contributions to the math-based tools that continue to be developed and the information science community can assist with contributing to efforts to both standardize citations and encourage their use.
As with many worthwhile efforts involving U.S. government agencies, many of the mandates arising from both statutes and OMB directives do not come with allocated resources for development and implementation. Rather, agencies have been expected to reallocate existing resources to put these initiatives in place. One exception to this has been the funding allocated to the National Center for Science and Engineering Statistics, which was authorized significant funding in the CHIPS Act (2022) to run a pilot project(s) to inform creation of a National Secure Data Service. In the case of the Democratizing Data platform, several philanthropic foundations were essential in providing the early funding for the research until agencies could contribute. If developing tools and a workforce capable of using and developing additional tools is to be a sustainable effort, then the resource questions must be addressed.
I have been privileged to be part of the Democratizing Data effort and also excited to be an editor for an upcoming special edition of HDSR, “Democratizing Data: Data as a Public Asset,” that will take an in-depth look at search and discovery tools, how they fit into the broader context of evidence building, some interesting use cases, how the tools can be expanded in the future to encompass additional publications, and what future innovations may be in store. The special edition should be published in early spring 2024 with the intent of stimulating additional discussion, research, and action in this important area.
The Democratizing Data project has been supported by Schmidt Futures, the Patrick J. McGovern Foundation, and the Alfred P. Sloan Foundation during its pilot stage. Agency partners who have provided funding are the National Center for Education Statistics, National Center for Science and Engineering Statistics, National Agricultural Statistics Service, and the Economic Research Service.
CHIPS Act of 2022, Public Law No: 117–167, 136 STAT. 1366 (2022). https://www.congress.gov/117/plaws/publ167/PLAW-117publ167.pdf
Foundations for Evidence-Based Policymaking Act of 2018, Public Law No: 115-435, 132 Stat. 5529 (2018). https://uscode.house.gov/statutes/pl/115/435.pdf
Lane, J., Gimeno, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2) https://doi.org/10.1162/99608f92.8a3f2336
Office of Management and Budget. (2016). Managing information as a strategic resource. OMB circular A-130. Retrieved from https://www.whitehouse.gov/wp-content/uploads/legacy_drupal_files/omb/circulars/A130/a130revised.pdf
Office of Management and Budget. (2020). Leveraging data as a strategic asset. Retrieved from https://strategy.data.gov/
Potok, N. A. (2019). Deep policy learning: Opportunities and challenges from the Evidence Act . Harvard Data Science Review, 1(2) https://doi.org/10.1162/99608f92.77e63f8f
Potok, N. A. (2022). Show US the data. Harvard Data Science Review, 4(2) https://doi.org/10.1162/99608f92.9d13ba15
©2023 Nancy Potok. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.