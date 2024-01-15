Abstract

Abstract. The prevalence of government-funded dataset usage has yet to be comprehensively tracked and understood. The lack of a standardized citation methodology has thus far prevented the government from understanding dataset usage in a transparent, accessible way. In this work, we seek to build on recent successes in natural language processing techniques and a recent Kaggle competition to develop an extensible framework for extracting government dataset usage from scientific publications. Further, we apply the developed techniques to over 50,000 scientific articles from Elsevier’s ScienceDirect collection. Finally, we show that improvements to the submitted algorithms along with ensembling improved overall performance on an evaluation dataset.

Keywords: datasets, machine learning, artificial intelligence, natural language processing



©2024 Ryan Hausen and Hosein Azarbonyad. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.