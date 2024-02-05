Abstract

Data access, use, and reuse are crucial for empirical science and evidence-based policymaking, and rely on metadata to facilitate data discovery and utilization by users and producers alike. Metadata quality is pivotal for the federal government to understand data available for evidence building, enabling agencies to identify data production gaps and redundancies, and enhancing evidence quality through reproducible research. The alignment of federal data agency incentives with the private sector, alongside technological advancements, now supports feedback-driven data classification, leveraging machine learning for improved data discoverability and categorization. This paper outlines the multiple classification needs of one statistical agency, the National Center for Science and Engineering Statistics (NCSES), and proposes a machine learning approach for classifying datasets based on usage in research, aligning with legislative and policy frameworks to enhance data governance, interoperability, and utility for evidence-based decision-making.

Keywords: data set search and discovery, metadata for data sets, labels for data sets, topic identification, research fields



03/05/2024: To preview this content, click below for the Just Accepted version of the article. This peer-reviewed version has been accepted for its content and is currently being copyedited to conform with HDSR’s style and formatting requirements.

©2024 Christina Zdawczyk, Julia Lane, Emilda Rivers, and May Aydin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.