Skip to main content
SearchLoginLogin or Signup

Show US the Data

Published onApr 28, 2022
Show US the Data

Editor’s Note: In this Diving Into Data column piece, Nancy Potok, a Senior Fellow at the George Washington University, CEO of NAPx Consulting, Former Chief Statistician of the United States, and Former Deputy Director and Chief Operating Officer of the U.S. Census Bureau, showcases the work of Show US the Data, a multi governmental agency, public-private collaborative pilot project with a diverse leadership team and diverse backing. Show US the Data is showcasing how to use AI methods to dramatically change the ways in which public data are used (including building communities around this data), and to also change what we know about the way in which public data is used. Of note is the creative use of podcasts to bring researchers together.

Keywords: research publications, rich text analysis, discovery

Data search and discoverability is high on the wish list of scientists and U.S. government agencies alike. Scientists want to learn what’s already been done, connect with other researchers, and reuse code when possible. And research agencies are under legislative pressure from the Foundations of Evidence-Based Policymaking Act of 2018, which requires agencies to solicit feedback from users on the utility and accessibility of their data, to connect with data users. The problem, however, is most U.S. agencies have no idea who is using their data or for what purpose. Even for important public policy problems that need statistical data or data gathered as part of administering major programs such as Medicaid, income taxes, Supplemental Nutrition Assistance Program (SNAP), or veteran’s health, agencies can only track how many hits they receive on the websites where they have made their public data sets available.

We can do much better. What is needed are automated search tools that help agencies see who is using their data in research, help researchers discover how others are using these data for research similar to their own, and in the course of making these community connections, help spark new research and better information to tackle priority social and economic issues. 

Fortunately, there is good news. Development of these tools is well underway. One ongoing effort, Show US the Data, has progressed to a major, multi-agency pilot project and is already showing early results. It is the brainchild of a group of collaborators led by New York University Professor Julia Lane, former Federal Chief Information Officer Suzette Kent, and me, and also includes the Texas Advanced Computing Center at the University of Austin, the global publishing and analytics company Elsevier, and the Institute for Data Intensive Engineering and Science at Johns Hopkins University. Because most scientific research is reported in journal articles, Lane and her partners have set out to use machine learning and natural language processing to conduct rich text analysis on a corpus of over 84 million publications under the management of publisher Elsevier to find citations of specific data sets. With backing from the Overdeck Family Foundation, Schmidt Futures, and CHORUS, Show US the Data ran a Kaggle competition to develop algorithms that would be up to the task. Over 1600 data science teams worldwide competed. The winning algorithms were unveiled at a conference in October 2021 and have been incorporated into a pilot project in which six federal agencies are participating, including the Economic Research Service and National Agricultural Statistics Service at the U.S. Department of Agriculture, the National Center for Science and Engineering Statistics at the National Science Foundation, the National Center for Education Statistics at the U.S. Department of Education, the National Oceanic and Atmospheric Administration at the U.S. Department of Commerce, and NASA. 

The pilot provides value for three primary participants in the research data ecosystem. First, the agencies creating the data will get a dashboard and other visualizations so they can see how their data are being used. The results can be surprising. For example, the Economic Research Service at the Department of Agriculture maintains a data set named Rural-Urban Continuum Codes. This data provides a classification scheme that distinguishes metropolitan counties by the population size of their metro area, and nonmetropolitan counties by degree of urbanization and adjacency to a metro area. A search in scientific journals conducted as part of the pilot project yielded the information that the codes are heavily used to conduct health research. That opens the door to a whole new group of stakeholders beyond the previously identified group of agricultural economists. 

A second output is to provide a community dashboard that shows the top datasets being used for research by different topics. That is, if you are interested in a particular topic and want to know which datasets have been used by published researchers to study that area, you would be able to see that at your fingertips instantly.

Finally, the pilot project is intended to develop a researcher leaderboard that shows the top researchers for different datasets based on the number of their publications. Of course, simply publishing a lot doesn’t guarantee that the research is influential, but it is a good place to start understanding who is using the data—identifying experts and helping build collaborative communities. The ancillary benefit to the researchers is a place where their publications can be widely seen and cited in future research. Who doesn’t want to be at the top of the leaderboard, after all?

As the pilot continues, the machine learning methods will be continuously improved, encouraging other agencies and collaborators to join in making this important discovery tool part of the sustainable data infrastructure.

One fun aspect of the Show US the Data pilot project is the creation of five-minute podcasts with researchers. In each podcast, I discover more about the important research they have been conducting with the identified datasets, how the research benefits the American public, advice to other researchers who want to access and use the data, and thoughts about how the data could be improved.

Below are six of the podcasts, discussing a variety of data sets from the pilot agencies. Each podcast has some brief show notes and a link to the transcription if you prefer to read rather than listen. I hope you enjoy them—comments and suggestions are encouraged, and they will help us improve the content of future podcasts, which we intend to be a regular feature of this column


Season 1 Episode 1: Dr. Ray Hart: National Assessment of Educational Progress

Do large urban schools get more bang for their buck than other school districts around the country? What happens when a large portion of your student body is not simply on the free or reduced-cost lunch program but some from a community that is in abject poverty? What are their environmental barriers to learning? Dr. Ray Hart, Executive Director of the Council of the Great City Schools discusses his research using National Assessment of Educational Progress (or NAEP) data collected by the National Center for Education Statistics at the U.S. Department of Education. His report, “Mirrors or Windows: How Well Do Large City Public Overcome the Effects of Poverty and Other Barriers” shatters myths and helps us think about urban education in new ways.

Season 1 Episode 2: Dr. Becca Jablonski: Agricultural Resource Management Survey (ARMS)

Artisan and locally produced food has garnered much attention and many fans over the last several years. These are often produced by beginning farmers and ranchers and distributed through local outlets such as farmers markets. Yet most farm programs are geared to support large producers. What do we need to know about this group of small and mid-size food producers in order to craft public policy that supports the next generation of farmers and encourages and supports rural economic development? Dr. Becca Jablonski, Associate Professor and Food Systems Extension Economist at Colorado State University, discusses her research using Agricultural Resource Management Survey (ARMS) data collected by the National Agricultural Statistics Service at the U.S. Department of Agriculture. In this conversation, she passes along valuable insights to other researchers interested in this underserved, yet vital group of food producers.

Season 1 Episode 3: Dr. Chen Zhen: Scanner Data-Based Panel Price Indexes

Inflation is on the minds of many consumers as they carefully assess the price of food, gas, and other everyday purchases. Are prices influencing our shopping choices? With us today is Dr. Chen Zhen, Georgia Athletic Association Professor in Food Choice, Obesity, and Health at the University of Georgia. His research has made use of the retail scanner data made available by the Economic Research Service at the U.S. Department of Agriculture, constructing scanner data-based panel price indexes. These data tell an important story about consumer buying choices and public policy.

Season 1 Episode 4: Dr. Tiffany Oliver: Survey of Earned Doctorates

Did you know that the highest number of Black women who earn PhDs in STEM fields each year are graduates of Spelman College? Have you ever wondered what their journey to a PhD looks like? How does their experience compare to men and women of other racial and ethnic backgrounds? Today’s episode discusses the Survey of Earned Doctorates (SED), an annual census conducted since 1957 of all individuals receiving a research doctorate from an accredited U.S. institution in a given academic year. The SED is sponsored by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF). The SED collects information on doctoral recipients’ educational history, demographic characteristics, and postgraduation plans. Results are used to assess characteristics of the doctoral population and trends in doctoral education and degrees. With us to discuss her work with these data is Dr. Tiffany Oliver, Associate Professor and Chair of the Biology Department at Spelman College.

Season 1 Episode 5: Dr. Janet Currie: Vital Statistics and the Supplemental Nutrition Program for Women Infants and Children (WIC)

Everyone can probably agree that we want American children to be healthy and well nourished. Several programs are in place to support children and their mothers even before those children are born. But are those programs effective? Do they actually make a difference in life outcomes for the young beneficiaries? Does maternal participation in government programs make a difference for their children? Our guest is Dr. Janet Currie, the Henry Putnam Professor of Economics and Public Affairs at Princeton University and the Co-Director of Princeton's Center for Health and Wellbeing. She also co-directs the Program on Families and Children at the National Bureau of Economic Research. Dr. Currie’s research has used National Vital Statistics data collected by the National Center for Health Statistics at the Centers for Disease Control to provide insights into how programs such as USDA's Supplemental Nutrition Program for Women, Infants, and Children (WIC) and Medicaid have influenced children’s health outcomes.

Season 1 Episode 6: Dr. Julia Lane: UMETRICS and the Survey of Earned Doctorates

Have you ever wondered how research experience shapes the decisions students make as they choose a career path? How do gender and race affect those research experiences and subsequent career decisions during graduate studies and early career pathways within STEM fields? Does federally funded research yield safer and more secure food systems? Can we measure how federally-funded university researchers produce additional innovations and contribute to regional growth? Today’s episode discusses the Survey of Earned Doctorates (SED) and the Institute for Research on Innovation and Science (IRIS) UMETRICS research program at the University of Michigan. Sponsored by the National Center for Science and Engineering Statistics (NCSES) within the National Science Foundation (NSF), the SED is an annual census conducted since 1957 of all individuals receiving a research doctorate from an accredited U.S. institution in a given academic year. The SED collects information on doctoral recipients’ educational history, demographic characteristics, and postgraduation plans. Results are used to assess characteristics of the doctoral population and trends in doctoral education and degrees. The UMETRICS dataset includes information on research spending, vendor contracts and employees from research institutions representing 41% of total U.S. university R&D spending. IRIS data have been used by researchers in higher education, economics, sociology, and many other fields. Hear Dr. Julia Lane, Professor at the NYU Wagner Graduate School of Public Service at the NYU Center for Urban Science and Progress, Provostial Fellow for Innovation Analytics, cofounder of the Coleridge Initiative, and leader of Show US the Data discuss some of the interesting findings from exploring UMETRICS data and linking it to the Survey of Earned Doctorates.


Podcasts are produced by XBit Technology

Disclosure Statement

Nancy Potok has no financial or non-financial disclosures to share for this article.

©2022 Nancy Potok. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?