Skip to main content
SearchLoginLogin or Signup

Discovering Datasets in Unstructured Corpora: Discovering Use and Identifying New Opportunities

Forthcoming. Now Available: Just Accepted Version.
Published onFeb 12, 2024
Discovering Datasets in Unstructured Corpora: Discovering Use and Identifying New Opportunities
·

Abstract

Federal statistical agencies are keenly aware that scientific research is not the only way in which their data assets are used for evidence building. Entities around the world have been building services on top of curated corpuses of scientific research, to help provide insight as to the importance of the works in the collection. This also provides a much-needed framework for statistical agencies and other dataset creators to search and find the usages and impacts of those datasets. While the other papers in this special issue largely apply machine learning models to find datasets in an extensively curated corpus, this paper starts with a much less structured framework and examines the potential to discover how datasets are used in writing that’s targeted at a broader base of users than scientific researchers, and describes the challenges and lessons learned from the exercise.

Keywords: Cooperative Extension Service, Beautiful Soup, National Agricultural Statistics Service, web scraping, Evidence Act, dataset impact



02/12/2024: To preview this content, click below for the Just Accepted version of the article. This peer-reviewed version has been accepted for its content and is currently being copyedited to conform with HDSR’s style and formatting requirements.


©2024 Nick Pallotta, John M Locklear, Xiangyu Ren, Victor Robila, and Adel Alaeddini. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?