The Democratizing Data Initiative looks to demonstrate the impact of the data assets produced by US federal agencies. In this paper we outline the process flows and cyberinfrastructure that support the achievement of that objective. We describe how the focus on data assets evolved from the initial search for datasets to the challenges of finding how the data assets have been used by the research community. In this regard, we explain both why we need to define a search corpus of full text records and the process by which that search corpus is created. We then explain the process of applying Machine Learning (Kaggle) algorithms to that full text search corpus and the further steps we take to refine the outputs generated by those algorithms including the manual validation by Subject Matter Experts that we employ to minimize false positive dataset identification in the final outputs. Once validation is complete, we explain how the outputs are made available to diverse stakeholders - agency staff, funding partners, general public - through a variety of channels including REST APIs, Dashboards, SciServer, Jupyter notebooks and direct database access. We conclude by looking at potential future enhancements to the process and possible research directions.
Keywords: data assets, impact, full text search, machine learning, Kaggle,
02/08/2024: To preview this content, click below for the Just Accepted version of the article. This peer-reviewed version has been accepted for its content and is currently being copyedited to conform with HDSR’s style and formatting requirements.
©2024 Attila Emecz, Arik Mitschang, Christina Zdawczyk, and Maytal Dahan. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.