Our world is awash in data. Much of it is collected and held by private sector companies whose employees transform it into valuable assets for their own purposes. But much data is also produced by researchers or government agencies and could potentially be transformed into valuable assets for the public good. That potential is now beginning to be realized thanks to important technological advances and policy interventions. Our goal in editing the Democratizing Data special issue is to draw attention to concrete successes, document some key lessons learned from a high-profile pilot, and sketch a future agenda to build on those successes.
There is an urgent need to produce better data for society’s future. Decision makers dealing with today’s complex issues, including climate change, economic, technological, and labor market changes, as well social disparities in health, income, and well-being, need local, timely, and actionable evidence developed from high-quality data to make good policy decisions. And data increasingly are available from multiple sources including, but not limited to, statistical agencies. Data access and use are no longer the purview of a small advantaged group of scientists and researchers; new infrastructures need to be built that are community driven, that can produce information that is valued and used, and where data is democratized (Lane, 2021; Potok, 2022).
These infrastructures must be designed to make sense of a messy sea of data. Research data are often fragmented, highly idiosyncratic in quality, badly documented and curated, and come in many nontraditional forms—video, voice, text, image, and sensor (Office of Science and Technology Policy, 2022); the same is true for government administrative records (Commission on Evidence-Based Policymaking, 2017). There are vast amounts of data, so curating, documenting, and disseminating data for use and reuse will require the engagement of many incentivized expert communities. A key part of the needed infrastructure is to build such communities of both users and producers around specific data sets and topics. That begins with identifying by whom disparate data are used. That helps scientists, including junior researchers and graduate students, find other experts in their scientific community. It helps government agencies and research funders identify usage of their data investment portfolio. It also requires finding out how data are used to study what topics so that government agencies can identify how well their data investments are supporting their mission, and scientists can find other work that is complementary to their own.
Fortunately, the timing is excellent to build such a democratized data infrastructure. Congress recognized the value of government data in the Foundations for Evidence-Based Policymaking Act of 2018 (2019; hereafter Evidence Act). The Evidence Act created mandates around both open and protected data. For open data, the chief data officers at federal agencies have been charged with creating publicly available inventories and data access; for protected statistical data, federal statistical agencies are required to provide secure access for researchers. The CHIPS and Science Act (2022) established a National Secure Data Service (NSDS) demonstration project at the National Science Foundation for increasing safe and secure access to protected data. Executive Order 14110 of October 30, 2023: Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence mandated that the National Science Foundation lead a consortium of agencies and nongovernmental partners to create the National Artificial Intelligence Research Resource (NAIRR) pilot.
This special issue highlights a number of important contributions to the democratizing data (DD) movement. One of these efforts is the Democratizing Data Search and Discovery Platform project, which is examined in detail in this special issue because that eight-year journey highlights a number of lessons learned.
The platform is the result of a congressional imperative, early philanthropic investments, tangible use cases, engagement by statistical agencies, and a dedicated research team.
The congressional imperative began in 2016 when New York University was asked by the U.S. Census Bureau to build a secure environment to host confidential micro-data to inform the decision-making of the Commission on Evidence-Based Policymaking. That request uncovered a major infrastructure problem. It quickly became apparent that simply hosting data was insufficient. Information about data use, quality, structure, and idiosyncratic issues was only available by word of mouth. Effectively, unless a new or junior researcher was able to tap into a research network of cognoscenti, they would either not know about the data, or not know how to use it—with obvious consequences for marginalized researchers. A mechanism was needed to search for, find, and publicly report on data set use.
Philanthropic funding and a tangible use case were the key to developing that initial mechanism. A joint project between the developers of the open source software Jupyter—Brian Granger and Fernando Perez—and Julia Lane of NYU was proposed to and funded by the Alfred P. Sloan Foundation and Schmidt Futures. The Rich Context project researched the potential to use machine learning and natural language processing tools to automate the discovery of data set mentions in scientific documents. The Overdeck Family Foundation also provided substantial additional funding (Lane et al., 2024). Mark Denbaly from the U.S. Department of Agriculture (USDA) and Eugene Burger and Tyler Christensen from the National Oceanic and Atmospheric Administration provided initial use cases.
That initial funding, combined with the use cases, enabled the NYU team to develop and host the first Rich Context competition in 2018; the results were reported in Rich Search and Discovery for Research Datasets (Lane et al., 2020). A national conference was subsequently hosted at the National Press Club in November 2019—also funded by Schmidt Futures and the Alfred P. Sloan Foundation—which was designed to produce a roadmap to identify the opportunities, gaps, and necessary investments, develop an interdisciplinary community of computer scientists, life scientists, and social scientists who could work together to address the problems and engage key stakeholders, notably funding agencies, and government agencies.
The output from the workshop fed into continuing work and during 2021, the effort resulted in a Kaggle competition, known as Show US the Data to develop open algorithms that would improve on the previous efforts (Lane et al., 2022). More than 1,600 data science teams worldwide competed. The winning algorithms were unveiled at a conference in October 2021 hosted by the Coleridge Initiative and the open access consortium CHORUS.
The next step was achieved with a pilot DD Search and Discovery platform that was initiated with four statistical agencies (National Center for Science and Engineering Statistics, Economic Research Service, National Center for Education Statistics, and National Agricultural Statistical Service) in partnership with one of the CHORUS board members, Elsevier, and with the support of the Patrick J. McGovern Foundation. The goal was to test the possibility of connecting agency data sets with Elsevier’s fully curated corpus of publications, Scopus, and demonstrate the feasibility of responding to Title II of the Evidence Act, also known as the OPEN Government Data Act.
The results of the pilot have been eye-opening in terms of the potential applications for usage data. In just two years, junior researchers at minority serving institutions have found how much they can learn by examining what topics have been studied. Often, data about their lived experiences were either not collected or not researched. Hitherto unknown uses of data by the agricultural extension service have been identified to inform USDA’s understanding of how their data investments have been used. Community workshops have shown how the platform can be used to find other data users—who were at the same university but unknown to each other. The American Statistical Association’s joint project with George Mason University to assess the health of Federal Statistical System agencies has incorporated the usage of agency data assets as a metric.
This special issue of HDSR suggests that the future is bright. The articles in this issue show that providing evidence about how data sets are used can spur the development of that community of agencies, researchers, and other stakeholders that is necessary to produce useful data and evidence to address society’s problems. The articles that point to the future suggest that the democratizing data movement could be a crucial turning point in opening access to high-quality, objective data to a wider community. The DD movement can make information available to all, whether they are in agencies, minority serving or emerging research academic institutions, or the philanthropic community. The establishment of the National Secure Data Service and the National Data Platform provide the opportunity to build such an infrastructure.
At least part of an initial research agenda has been identified, although we expect that the evident need and interest will create new uses and applications if the ideas in this special issue become institutionalized. For example, more work can be done to expand search and discovery to open access publications, government publications, and, with the right incentives, incorporate other types of writings from data users who would like their work discovered. The machine learning approaches can be improved—both as AI tools become more efficient and as the community provides more input and corrections to improve the training and test data. There is a vast agenda for communities to develop context-specific search and discovery taxonomies. The sociology of search may become a research field in its own right.
We hope that this special issue is not only a look at what has been done over the last 8 years but is also a way to peer into the future and galvanize the democratizing data community to expand the work. Our future depends on it.
The work described here draws on the contributions of many people. A partial list includes our coeditor, Atti Emecz, as well as the many people who are not mentioned elsewhere in this special issue but contributed to the platform: Rafael Ladislau Alves, Vipin Arora, Kevin Barnes, Eugene Burger, Tyler Christensen, Josua De La Rosa, Mark Denbaly, M’hamed el Aisati, Ernesto Gimeno, Clayton Hunter, Juliana Jaime, Suzette Kent, Andy Kerns, Ekaterina Levitskaya, Kelly Maguire, Arik Mitschang, Paco Nathan, and Zheyuan Zhang.
The authors acknowledge the support of the Patrick J. McGovern Foundation, the National Center for Science and Engineering Statistics (NCSES) of the U.S. National Science Foundation (NSF), and the Economic Research Service (ERS) and the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA).
CHIPS and Science Act, Pub. L. No. 117–167, 136 Stat. 1366 (2022). https://www.congress.gov/bill/117th-congress/house-bill/4346
Commission on Evidence-Based Policymaking. (2017). The promise of evidence-based policymaking. https://www2.census.gov/adrm/fesac/2017-12-15/Abraham-CEP-final-report.pdf
Exec. Order No. 14110, 3 C.F.R. 75191 (2023). https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174
Lane, J. (2021). Democratizing our data: A manifesto. MIT Press.
Lane, J., Feldman, S., Greenberg, J., Sotsky, J., & Dhar, V. (2024). The importance of philanthropic foundations in democratizing data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.3f34436c
Lane, J., Gimeno, E., Levitskaya, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.8a3f2336
Lane, J. I., Mulvany, I., & Nathan, P. (Eds.). (2020). Rich search and discovery for research datasets: Building the next generation of scholarly infrastructure. Sage. https://study.sagepub.com/richcontext
Office of Science and Technology Policy. (2022). Envisioning a National Artificial Intelligence Research Resource (NAIRR): Preliminary findings and recommendations. https://ai.gov/wp-content/uploads/2023/12/NAIRR-TF-Interim-Report-2022.pdf
Potok, N. (2022). Show US the data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.9d13ba15
©2024 Julia Lane and Nancy Potok. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.