The prevalence of government-funded data set usage has yet to be comprehensively tracked and understood. The lack of a standardized citation methodology has thus far prevented the government from understanding data set usage in a transparent, accessible way. In this work, we seek to build on recent successes in natural language processing techniques and a recent Kaggle competition to develop an extensible framework for extracting government data set usage from scientific publications. Further, we apply the developed techniques to over 50,000 scientific articles from Elsevier’s Scopus collection. Finally, we show that improvements to the submitted algorithms along with ensembling improves overall performance on an evaluation data set.
Keywords: data sets, machine learning, artificial intelligence, natural language processing
Understanding how government-funded data is used in science is integral to deciding how to most effectively allocate government resources. The reporting of data set usage has not yet been standardized or widely adopted in the scientific literature. Therefore, uncovering data set usage requires manual review and best-guess estimation. Lane et al. (2022) ran a Kaggle Competition1 (both referred to as COL22 hereafter) to develop automated approaches to discovering government data sets in texts. In this work, we seek to analyze and improve the submitted methods and describe the application of these methods at the scale of tens of thousands of documents. The code and model parameters are freely available at https://purl.archive.org/democratizing-data/code.
At a high level, the participants in the COL22 competition were tasked with providing a Jupyter Notebook2 that when run iterates over a list of documents and outputs a comma-separated value (CSV) file containing a column for the document IDs and a column containing a pipe (
The remainder of the work is laid out as follows: Section 2 discusses related work, Section 3 describes the data, Section 4 details the models used and a framework for future development and evaluation of models, Section 5 reports the results, Section 6 discusses the applications of the models on tens of thousands of documents, and Section 7 reviews the work done and proposes future research directions.
There has been a growing interest in methods to detect data set references in scientific texts. Previous work has identified informally cited references like data sets and software in the scientific literature. These works differ from ours in either scale or focus. Duck et al. (2013) developed a technique for detecting bioinformatics software and database resources in scientific texts. They use a data set of 60 full text articles. Twenty-five articles for training, 5 articles for development, and 25 articles for evaluation. Luan et al. (2018) examined 500 abstracts and builds a knowledge graph that identifies scientific entities. Zhao et al. (2019) developed methods for identifying the role and function of hyperlinked resources in scientific papers. Otto et al. (2023) developed a technique and data set for entity extraction of machine learning models and data sets. Ghavimi, Mayr, Lange, et al. (2016) and Ghavimi, Mayr, Vahdati, and Lange (2016) describe methods for detecting data sets in social science papers and evaluate the efficacy of their models using 15 hand-curated documents.
Several studies have experimented with the detection of data set mentions in natural language processing (NLP) as a named entity recognition (NER) task. Early work mostly focused on architectures that combined word representation vectors with recurrent neural networks (RNN), like long short-term memory networks (LSTM) and bidirectional LSTMs (BiLSTM) (Hou et al., 2021; Li et al., 2021). More recent studies have experimented with data set mention detection as a downstream task for pretrained language models such as BERT (Bidirectional Encoder Representations from Transformers), outperforming the previously mentioned models (Heddes et al., 2021). Some recent work has focused on improving the scale of training and test data along with model development. Hou et al. (2019) developed an automated approach for generating leader boards for data sets and tasks from NLP-focused papers. Prasad et al. (2019) develops convolutional neural network and recurrent neural network–based approaches to the Coleridge Initiative’s Rich Context Competition data set. Heddes et al. (2021) developed an approach to detect data sets in artificial intelligence (AI) papers using a corpus of
For model training and evaluation, we leveraged the COL22 Kaggle competition data consisting of
All data set labels were preprocessed according to the requirements of the competition. Specifically, all data set labels are converted to all lowercase characters and any non-alphanumeric characters are replaced with single-space characters. An example of the preprocessing is shown below,
Aging Integrated Database (AGID) → aging integrated database agid
.
Though cased versions of the labels are available and some of the implemented models reproduce the mentions in their cased form, evaluations are performed on lowercase versions to fairly compare between models submitted to the competition and further developed methods.
The training labels from the COL22 competition consists of the raw string representation of the target data set associated with the text (i.e., “our world in data,” “beginning postsecondary students,” and “baccalaureate and beyond longitudinal study”). The training labels do not include information about their location in the text or any categorical information about them.
The methods developed in this work build upon the top three submissions to the COL22 competition by extracting the core methods from the submissions, analyzing and improving the core methods, and building an ensemble that benefits from the strengths of each method. The submissions were in the form of a Jupyter Notebook3 that is executed in a Python environment. The algorithms for each submission are refactored into a single code base with a unified interface for each submitted and newly developed model. The algorithmic approaches are organized into four categories: string matching, entity classification, token classification, and ensemble.
Each of the submitted methods, as noted above, was submitted in the form of a Jupyter Notebook. Each submission was not restricted to submitting a simple function that consists of a model only but included additional heuristics wrapped around the model to improve the final competition score. With the exclusion of the simple string-matching method, the submitted models include additional heuristics. Leveraging additional heuristics creates an implicit ensemble that includes the core model and the additional heuristics. Below, in Equation 4.1, is an example of what an ensemble might look like.
The additional heuristics submitted alongside the core methodology, though helpful in improving the competition score, obfuscate the efficacy of the core method. Consider the ensemble described in Equation 4.1,
To evaluate the core component of the submitted and proposed methods, we developed an extensible framework for developing and evaluating methods on the COL22 data. The submission criteria for the COL22 competition included that the format be a Jupyter Notebook and that the notebook finish running on the test data within 9 hours. Clearly analyzing the core components of the submitted algorithms was difficult in the submitted Jupyter Notebooks due to different types of implementations and coding practices. We refactor and reimplement the core methods such that they follow a unified interface. Further, we implement functions to evaluate methods on the COL22 data including ensembles of multiple methods. To promote openness and contributions from the community, we publicly release our code and model parameters, which are available at https://purl.archive.org/democratizing-data/code.
The methods examined in this work are organized into three categories: String Matching (Section 4.3), Entity Classification (Section 4.4), and Token Classification (Section 4.5). In Section 4.6 we discuss an ensemble approach leveraging models from each of the categories.
The third-place submission to the COL22 competition was a simple string-matching algorithm (Mikhail, 2021) where a given list of data sets4 is compared to the text of documents with some preprocessing (see Algorithm 1). Though this algorithm is fast, it is fragile in two important ways. First, the algorithm makes all data sets and text lowercase before matching. This causes the algorithm to generate false positives when an acronym shares a spelling with another word. An example of this is the Higher Education Research and Development Survey5, abbreviated HERD, which can also, and more commonly, refer to a group of animals. Homographs like these result in a large number of false positives. Second, the submitted algorithm splits the input text into sentences on every occurrence of a period. Each ‘sentence’ is then searched for all the data sets. However, periods do not always indicate the end of a sentence and breaking up the text this way can cause the algorithm to miss data sets that were improperly separated.
To improve this algorithm, we replace simple string matching with a regular expression-based matching algorithm (see Algorithm 2). Using a regular expression allows the algorithm to selectively enforce casing. By selectively enforcing casing we eliminate the false positives generated by acronyms that share a spelling with another more common word. Further, we run the regular expression on the entirety of the text, which will avoid missing data sets from improperly splitting the text into sentences. Finally, using regular expressions offers more flexibility beyond casing for variations on data set names. Some data sets may have numerical variations such as year or version that need special rules beyond casing that can be better handled by a regular expression.
The second-place submission to the COL22 competition was an entity classification–based method (Lee, 2021). Entity classification is an approach comprised of two steps. The first step is to extract all entities from a target document (extraction). The second step is to classify an entity as being a data set or not (classification). Improving the performance of the algorithm can be done by improving the extraction and classification algorithms.
The submitted extraction algorithm is an implementation of the Schwartz and Hearst (2003) (SH03) algorithm for detecting abbreviation definitions. The SH03 algorithm seeks to find abbreviations in the form of: Abbreviated Entity Definition (AED). Though generally effective, not all data sets are referenced using that form. Some data sets like the USDA Census of Agriculture do not have a common acronym. Other data sets like the OECD’s PISA (Programme for International Student Assessment) data set are so commonly used that the acronym may not be defined in the text at all. Additionally, the submitted implementation was not robust. For example, nested parentheses and abbreviations that include year references, i.e., Abbreviated Entity Definition (AED;95) are not correctly handled.
In our approach, we opt to replace SH03 with a regular expression for entity extraction shown below (split into multiples lines for visibility),
(((([A-Z][a-z]{2,}|[A-Z]{3,})('s)?)\ )
(([A-Z][a-z]{2,}|[A-Z]\3,}|and\ |for\ |in\ |of\ |the\ |on\ |to\ )('s)?(\ )?){2,}
(?<!and\ |for\ |in\ |of\ |the\ |on\ |to\ )(\([A-Z]{3,}\))?)
The above regular expression can be translated semantically as the following:
Match a sequence of characters that start with a capital character followed by at least two lowercase letters OR at least three capital letters. Followed by an optional apostrophe s [‘s]. Followed by a space.
Pattern (1) or one of the following words “and”, “for”, “in”, “of”, “the”, “to” followed by an optional apostrophe s [‘s] or a space. Pattern (2) repeated 2 or more times.
Match a sequence of at least 3 capital letters in between parenthesis, so long as it is not preceded by one of: “and,” “for,” “in,” “of,” “the,” “on,” or “to.”
In designing a regular expression, we sought to capture as many entities as possible. Our proposed regular expression matches strings that would be detected by SH03 and additional entities that would have otherwise been missed, most obviously those that do not end with an acronym wrapped in parenthesis. The regular expression captures many more non-data set entities than SH03 (i.e., “United States of America (USA)” and “Jenny and John”). However, the emphasis for the extraction algorithm is to improve recall as it should be paired with a classifier that is responsible for the precision of the output data sets.
The submitted classifier was fine-tuned for sentence classification from a RoBERTa-based (Liu et al., 2019) model. The training data for the classifier, also submitted by the participant,6 consists of entities with their labels. In training our model, we also use a RoBERTa-based model but leverage an additional 29 known agency data sets for additional training samples that were provided from agency stakeholders.7 Leveraging additional training samples produces more reliable deep learning models that can better capture the distribution of data set entities.
The first-place submission to the COL22 competition was a token classification–based method (Nguyen & Nguyen, 2021). Token classification is an approach where each token within a text is classified. Tokens can be words or subwords, pieces depending on the tokenizer. Applied in this setting, token classification indicates whether or not a token refers to a data set. The first-place submission to the COL22 competition used a masked token classification approach.
The first-place submission to the COL22 competition used an ensemble of two models utilizing a custom token classification approach based on masked language models (Nguyen and Nguyen [2021] and Figure 1). The model is trained to learn the context of sentences that typically describe data sets. The specific training details can be found in Nguyen and Nguyen (2021). At inference time, the model converts the tokens into an embedding space that is compared to the average embedding for 100 randomly sampled data set-like tokens and 100 randomly sampled non-data set like tokens. The strength of this approach is that the model is trained to recognize tokens that surround data sets as part of the classification process. Data set names are not completely obscured from the model during training and so the model does not completely rely on context. The drawback of this approach is that the model requires data set and non–data set embedding prototypes that can be compared to the model output emeddings for inference. The submitted algorithm randomly samples from a collection containing 71,627 embeddings for both data set and non–data set embeddings. The collection is sampled for each batch inference.
In our experiments, we fine-tune a SciBERT-based (Beltagy et al., 2019) transformer for Named Entity Recognition (NER) on the COL22 competition data. Using an NER model removes the dependency on generating data set and non–data set embeddings. The original COL22 data does not include any part-of-speech or data set tagging. Therefore, to train a NER model, we generate Part-of-Speech ( POS) tags using spaCy (Montani et al., 2021) and convert the POS tags to data set label tags by searching the text for the given data set names and labeling the tags (“B-DAT” and “I-DAT”) when they match a data set and the generic ‘O’ tag for all other tokens.
We additionally train another model of the same type as Nguyen and Nguyen (2021) using a RoBERTA-based transformer using a refactored version of the Nguyen and Nguyen (2021) submission. Our implementation serves as a prototype should other contributors wish to further develop and train models in that family of models.
In practice, we use an ensemble classifier (Polikar, 2012) composed of models from different categories. For the calculation of metrics, we ensemble multiple models by merging the evaluation statistics for each ensembled model on each of the test documents from the COL22 competition. We take the set union of any ensembled models. Any time two models have a shared entry where one has the data set evaluated as a false negative and the other has the entry evaluated as a true positive, that entry is considered a true positive. All false positives after the set union is performed are kept in the ensemble’s calculation. Ensemble methods have been demonstrated to generally perform better than any single classifier on its own. As shown in Table 1, the ensemble method has a better F
The methods were evaluated using the private validation data set from the COL22 competition. The data is validated by comparing the set of labels from the human labelers,
where
Using the Jaccard index is computationally efficient and easy to understand, however, it can be rigid and flexible in ways that are not necessarily helpful for our purposes. With respect to its rigidity, simple strict string matching does not have any tolerance for spelling differences either by mistake from the human labelers, the model, or from some other external error (i.e., a mistake by an optical character recognition program that digitized the document). Concerning flexibility, the Jaccard index operates on sets, which by definition are unordered, excluding an important facet of language.
To more robustly compare partial_ratio
from the thefuzz
Python package. When two strings are different lengths the partial_ratio
function finds the substring in the longer string with the shortest distance to the shorter string and normalizes that distance by the length of the shorter string, yielding a value between 0 and 100. We divide this value by 100, yielding a value between 0 and 1. First, any single character variants, however introduced, cost a single character distance rather than the entire word. Further, order is enforced in the normalized Levenshtein distance, which will reduce false positives for sequences that utilize common words. In our experiments, we consider two strings to be referring to the same data set if partial_ratio
Model | Precision | Recall | F |
---|---|---|---|
String Matching | |||
(1) Submitted Simple String Match | 0.90 | 0.07 | 0.13 |
(2) Regex Enhanced Match | 1.0 | 0.06 | 0.11 |
Entity Classification | |||
(3) Submitted Model | 0.70 | 0.08 | 0.15 |
(4) Retrained Model | 0.45 | 0.51 | 0.48 |
Token Classification | |||
(5) NER (SciBERT) | 0.86 | 0.13 | 0.22 |
Ensemble | |||
(2) + (4) | 0.45 | 0.52 | 0.48 |
(2) + (5) | 0.87 | 0.14 | 0.25 |
(4) + (5) | 0.45 | 0.52 | 0.48 |
(2) + (4) + (5) | 0.45 | 0.53 | 0.49 |
Note. Three of the token classification approaches: Submitted Model (Roberta)0.70, Submitted Model (SciBERT)0.70 (Nguyen & Nguyen, 2021) , and Retrained Model (RoBERTa)0.70 (this work), described in Section 4.5 and Figure 1, randomly sample from precomputed vector-embeddings to compute classification scores for each token. The results of these approaches can be highly skewed, and the distribution was not readily amenable to tabular summarization. The performance of this approach was inferior to the others and is not included in Section 6. For completeness, we graphically display the results of running these approaches on the validation set in Figure 2, which shows that the performance of the Submitted models heavily skews toward zero for both precision and recall.
In this section, we describe the application of our proposed models (2, 4, 5 from Table 1) on a large subset of articles from Elsevier’s Scopus database of articles, hereafter called the Scopus Search Corpus. The process for creating the Scopus search corpus is described in detail in a further paper in this special edition (Emecz et al, 2024). In summary, the process involves identifying a subset of records from the Scopus database where licenses from the publishers permit a full text search and where the records are consistent with the date range and subject areas of specific interest.
In our analysis, we focus on a subset of these articles relevant to data sets of interest to the National Center for Science and Engineering Statistics (NCSES). The Scopus Search Corpis consisted of token_sort_ratio
function from the fuzzywuzzy
9 package with a threshold of 85/100. The resulting document set used for analysis here is
Model | Average Unique Detections per Document | Average Detection Length (characters) |
---|---|---|
Regex Enhanced Match | ||
Retrained Model | ||
NER (SciBERT) | ||
Ensemble |
Note. For each proposed model, the average number of unique detections per document for all documents where that model detects a data set and average length of detections in characters. For each model for each measurement single standard deviation is also reported.
We examine the model outputs in four ways. First, in Table 2 we show, for all documents where a model has at least one detection, the average number of unique detections and average length of the detections in characters. Interestingly, our Retrained Model
Second, in our analysis we look at the most common detections by each model in Table 3. In Table 3 we see that the most common detection for the Regex Enhanced Match model are almost all acronyms. This seems intuitive because it is common practice to introduce an entity with its full name and an acronym and then refer to it by acronym only for the remainder of the text. The sixth entry for the Regex Enhanced model is interesting as well because though the entity “Fields of Study” is indicated as a data set to look for, it is likely to also be associated with plenty of false positives. The results for the Retrained Model
Rank | Regex Enhanced Match | Retrained Model | NER (SciBERT) |
---|---|---|---|
(1) | R&D intensity | Science and Engineering Indicators | Higher Education |
(2) | SED | Web of Science | Survey of Earned Doctorates |
(3) | FAC | Survey of Earned Doctorates | Survey of Doctorate Recipients |
(4) | SDR | National Survey of College Graduates | in Science and Engineering |
(5) | FSS | Survey of Doctorate Recipients | Survey of College Graduates |
(6) | fields of study | American Community Survey | National Survey of College Graduates |
(7) | GSS | National Survey of College Graduates (NSCG) | Survey of Graduate Students and Postdoctorates in Science and Engineering |
(8) | STEM education | American Community Survey (ACS) | All |
(9) | FFS | Journal of Economic Surveys | Survey |
(10) | BERD | Survey of Earned Doctorates (SED) | Survey of Research and Development |
Note: For each of the proposed models, the top 10 detected data sets are reported. The output classifications are not modified with any spelling errors and premature text truncations preserved. See (4) and (9) in the NER column for examples.
Third, in our analysis, we examine the least commonly detected data sets in Table 4. As expected, the Regex Enhanced Match model produces reasonable results due to its high precision. The Retrained Model
Rank | Regex Enhanced Match | Retrained Model | NER (SciBERT) |
---|---|---|---|
(1) | S&E indicators | British Labour Force Survey (BLFS) | and |
(2) | fall enrollment survey | Turkish Academic Career Survey | Higher Education Data |
(3) | Science and Engineering Labor Force | Summer Undergraduate Research Experiences (SURE) | National Postdoctoral Association |
(4) | Academic Research and Development | Survey on Science and Technology Research | International STEM Graduate Student in the |
(5) | The State of U.S. Science and Engineering | IIP Patent Database | Longitudinal |
(6) | Business Enterprise Research and Development Survey | NISTEP Corporate Name Dictionary | Higher Education R |
(7) | Survey of Federal Science and Engineering Support | The Survey of Research and Development | Funds for Research and Development |
(8) | IPEDS Enrollment | Large Research Infrastructures OECD Global Science Forum Homepage | Higher Education Research Institute |
(9) | Women, minorities, and Persons with Disabilities | Trends in Citation | Higher Education and |
(10) | National center for science and engineering statistics | Library Genesis Project | Trends in Education |
Note: For each of the proposed models, the bottom 10 detected data sets are reported. The model outputs are not modified with any spelling errors and premature text truncation are preserved. See (1) and (6) in the NER column.
Finally, we quantitatively examine the correctness of the proposed method outputs by calculating an approximate precision value between the predicted data sets with previously identified data sets on a subset of the Scopus Search Corpus. The previously identified data sets were found using the methods submitted to the Kaggle competition and validated by a group of experts that viewed the prediction with some surrounding text. A total of
The Regex Enhanced Match model had a notable decrease in precision, from
The Retrained Model
The NER (SciBERT)
Model | Approximate Precision | # of Detections |
---|---|---|
Regex Enhanced Match | ||
Retrained Model | ||
NER (SciBERT) | ||
Ensemble |
Note: The number of detections and approximate precision for the proposed models on the subset of the Scopus Search Corpus for which we some validated detections exist. The precision is approximate because the documents used for analysis may include incorrect false-positives. See Section 6 for more details on the subset of the Scopus Search Corpus documents used for analysis. See Table 6 for examples of false positives.
Rank | Regex Enhanced Match | Retrained Model | NER (SciBERT) |
---|---|---|---|
(1) | fields of study [57] | Web of Science [23] | Higher Education [9] |
(2) | STEM education [45] | American Community Survey [17] | Current [4] |
(3) | R&D intensity [30] | American Community Survey (ACS) [11] | Current Population Survey [4] |
(4) | SED [25] | Current Population Survey [8] | in Science and Engineering [4] |
(5) | FAC [19] | World Development Indicators [6] | in [4] |
(6) | STEM Education [16] | National Longitudinal Survey of Youth [5] | in Science [3] |
(7) | BERD [11] | Current Population Survey (CPS) [4] | Study [3] |
(8) | GSS [6] | Science Citation Index Expanded [4] | Education Longitudinal Study [2] |
(9) | higher education R&D [6] | Integrated Postsecondary Education Data System [4] | National Study of [2] |
(10) | R&D employment [6] | Web of Science Core Collection [4] | Baccalaureate and Beyond Longitudinal Study [2] |
Note: Top 10 false positives for each of the proposed methods on the subset of the Scopus Search Corpus for which some data sets have been identified recorded as “Name of Dataset [Count]”. Some of the false positives seem to be previously unidentified data sets. For consistency, these are all recorded as false positives in the approximate precision calculation in Table 5. See Section 6 for more information about the subset of documents used for analysis.
In this work, we extracted and improved upon the core methodologies submitted to the COL22 competition. We have developed an open source framework for developing and evaluating new models on the same data. We further described how the methods are applied at scale to uncover the usage of government data sets in the scientific community. Model evaluation results are described in detail in Table 1. Interestingly, the strongest recall values, outside the Ensemble category, are found in the Entity Classification category. Unsurprisingly, the strongest precision values are found in the String Matching category. An ensemble approach comprised of String Matching, Entity Classification, and Token Classification yields the best overall F
In future work, we propose improving the methodologies presented here in a few areas. The regular expression used in Section 4.3 can be further tested and improved for unknown corner cases. Other approaches to entity extraction (Section 4.4), for example, considering sentence classification for extraction and NER classification as discussed in Section 2 could be a fruitful avenue to explore. For token classification–based models (Section 4.5), a closer and more thorough evaluation of all of the generated tags and data set locations would improve the quality of the training data.
Next, a closer look at the COL22 competition inputs and labels suggests that the labels might benefit from a more unified set of metadata that connects common data set acronyms back to a ‘parent’ data set definition, for example, analogous to Istrate et al. (2022). Additionally, a rich taxonomy of labels would likely help improve model analysis by revealing which fields of study have data set labels that are more or less likely to be discovered by an automated algorithm. Future improvements in supervised approaches, like the ones explored and proposed here, would necessitate the development of a true gold standard data set. Though some work on a gold standard has been done (see Section 2), a comprehensive and agreed upon benchmark has yet to be defined.
Further, large language models like ChatGPT (OpenAI, 2023), Falcon LLM (Penedo et al., 2023), and Llama 2 (Touvron et al., 2023) could be applied to this data set and may further yield improved results. Finally, this work focused on supervised classification of known data sets, and future work could explore unsupervised data set discovery.
Ryan Hausen (RH) and Hosein Azarbonyad (HA) acknowledge funding from the National Center for Science and Engineering Statistics grant number 49100422C0028 and the Patrick J. McGovern Foundation. RH acknowledges funding from Schmidt Futures. The authors acknowledge that this research makes use of SciServer, a resource developed and operated by the Institute for Data Intensive Engineering and Science at The Johns Hopkins University. RH acknowledges that he is employed by Johns Hopkins University. DH acknowledges that he is employed by Elsevier
Bassinet, A., Bracco, L., L’Hôte, A., Jeangirard, E., Lopez, P., & Romary, L. (2023). Large-scale machine-learning analysis [Preprint]. https://hal.science/hal-04121339
Beltagy, I., Lo, K., & Cohan, A. (2019, November). SciBERT: A pretrained language model for scientific text. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (pp. 3615–3620). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1371
Duck, G., Nenadic, G., Brass, A., Robertson, D. L., & Stevens, R. (2013). Bionerds: Exploring bioinformatics’ database and software use through literature mining. BMC Bioinformatics, 14(1), Article 194. https://doi.org/10.1186/1471-2105-14-194
Emecz, A., Mitschang, A., Zdawczyk, C., Dahan, M., Baas, J., & Lemson, G. (2024). Turning visions into reality: Lessons learned from building a search and discovery platform. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.d8a3742f
Fan, L., Lafia, S ., Wofford, M. , Thomer, A. , Yakel, E. , & Hemphill, L. (2023). Mining semantic relations in data references to understand the roles of research data in academic literature. In 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL) (pp. 215–227). IEEE. https://doi.org/10.1109/JCDL57899.2023.00039
Ghavimi, B., Mayr, P., Lange, C., Vahdati, S., & Auer, S. (2016). A semi-automatic approach for detecting dataset references in social science texts. Information Services & Use, 36(3-4), 171–187. https://doi.org/10.3233/ISU-160816
Ghavimi, B., Mayr, P., Vahdati, S., & Lange, C. (2016). Identifying and improving dataset references in social sciences full texts. ArXiv. https://doi.org/10.48550/arXiv.1603.01774
Heddes, J., Meerdink, P., Pieters, M., & Marx, M. (2021). The automatic detection of dataset names in scientific articles. Data, 6(8), Article 84. https://doi.org/10.3390/data6080084
Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2019). Identification of tasks, datasets, evaluation metrics, and numeric scores for scientific leaderboards construction. In A. Korhonen, D. Traum, & L. Màrquez (Eds.), Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 5203–5213). Association for Computational Linguistics. https://doi.org/10.18653/v1/P19-1513
Hou, Y., Jochim, C., Gleize, M., Bonin, F., & Ganguly, D. (2021). TDMSci: A specialized corpus for scientific literature entity tagging of tasks datasets and metrics. In P. Merlo, J. Tiedemann, & R. Tsarfaty (Eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume (pp. 707–714). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.eacl-main.59
Istrate, A.-M., Li, D., Taraborelli, D., Torkar, M., Veytsman, B., & Williams, I. (2022). A large dataset of software mentions in the biomedical literature. ArXiv. https://doi.org/10.48550/arXiv.2209.00693
Jaccard, P. (1912). The distribution of the flora in the alpine zone. New Phytologist, 11(2), 37–50. https://doi.org/10.1111/j.1469-8137.1912.tb05611.x
Lane, J., Gimeno, E., Levitskaya, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.8a3f2336
Lee, C. M. (2021). Model summary. GitHub. https://github.com/Coleridge-Initiative/rc-kaggle-models/blob/main/2nd%20Chun%20Ming%20Lee/Model%20Summary.pdf
Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics-Doklady, 10(8), 707–710.
Li, P., Liu, Q., Cheng, Q., & Lu, W. (2021). Data set entity recognition based on distant supervision. The Electronic Library, 39(3), 435–449. https://doi.org/10.1108/EL-10-2020-0301
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. ArXiv. https://doi.org/10.48550/arXiv.1907.11692
Luan, Y., He, L., Ostendorf, M., & Hajishirzi, H. (2018). Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 3219–3232). Association for Computational Linguistics. https://doi.org/10.18653/v1/D18-1360
Mikhail, A. (2021). Strong, careful, simple: Solution report. GitHub. https://github.com/Coleridge-Initiative/rc-kaggle-models/blob/main/3rd%20Mikhail%20Arkhipov/Coleridge_Report_Mikhail_Arkhipov_3d_place.pdf
Montani, I., Honnibal, M., Honnibal, M., Landeghem, S. V., Boyd, A., Peters, H., Samsonov, M., McCann, P. O., Geovedi, J., Regan, J., Orosz, G., Altinok, D., Kristiansen, S. L., Roman, Bot, E., Fiedler, L., Howard, G., Phatthiyaphaibun, W., Tamura, Y., . . . Henry, W. (2021). explosion/spaCy: v3.1.2: Improved spancat component and various bugfixes (Version v3.1.2). Zenodo. https://doi.org/10.5281/zenodo.5226955
Nguyen, K., & Nguyen, M. (2021). Model summary. GitHub. https://github.com/Coleridge-Initiative/rc-kaggle-models/blob/main/1st%20ZALO%20FTW/MODEL_SUMMARY.pdf
OpenAI. (2023). GPT-4 Technical Report. ArXiv. https://doi.org/10.48550/arXiv.2303.08774
Otto, W., Zloch, M., Gan, L., Karmakar, S., & Dietze, S. (2023). GSAP-NER: A novel task, corpus, and baseline for scholarly entity extraction focused on machine learning models and datasets. ArXiv. https://doi.org/10.48550/arXiv.2311.09860
Pan, H., Zhang, Q., Dragut, E., Caragea, C., & Latecki, L. J. (2023). DMDD: A large-scale dataset for dataset mentions detection. Transactions of the Association for Computational Linguistics, 11, 1132–1146. https://doi.org/10.1162/tacl_a_00592
Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., & Launay, J. (2023). The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. ArXiv. https://doi.org/10.48550/arXiv.2306.01116
Polikar, R. (2012). Ensemble learning. In C. Zhang & Y. Ma (Eds.), Ensemble machine learning: Methods and applications (pp. 1–34). Springer New York. https://doi.org/10.1007/978-1-4419-9326-7_1
Prasad, A., Si, C., & Kan, M.-Y. (2019). Dataset mention extraction and classification. In V. Nastase, B. Roth, L. Dietz, & A. McCallum (Eds.), Proceedings of the Workshop on Extracting Structured Knowledge (pp. 31–36). Association for Computational Linguistics. https://doi.org/10.18653/v1/W19-2604
Schwartz, A. S., & Hearst, M. A. (2003). A simple algorithm for identifying abbreviation definitions in biomedical text. In R. B. Altman, A. K. Dunker, L. Hunter, T. A. Jung, & T. E. Klein (Eds.), Pacific Symposium on Biocomputing 2003 (pp. 451–462). IEEE. https://doi.org/10.1142/9789812776303_0042
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., . . . Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. ArXiv. https://doi.org/10.48550/arXiv.2307.09288
Yao, R., Hou, L., Ye, Y., Zhang, J., & Wu, J. (2019). Method and dataset mining in scientific papers. In 2019 IEEE International Conference on Big Data (Big Data) (pp. 6260–6262). https://doi.org/10.1109/BigData47090.2019.9006262
Younes, Y., & Scherp, A. (2023). Question answering versus named entity recognition for extracting unknown datasets. IEEE Access, 11, 92775–92787. https://doi.org/10.1109/ACCESS.2023.3309148
Zhao, H., Luo, Z., Feng, C., Zheng, A., & Liu, X. (2019). A context-based framework for modeling the role and function of on-line resource citations in scientific literature. In K. Inui, J. Jiang, V. Ng, & X. Wan (Eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language (pp. 5206–5215). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1524
The Levenshtein distance equation, adapted from https://en.wikipedia.org/wiki/Levenshtein_distance.
May 17, 2024
This article contains changes to a previous version of the article. There are three notable changes to the article: the described metric, the values in Table 1, and the values in Table 5. Each change is detailed below:
The Described Metric. The metric, as described in the current version of the text, is the Levenshtein distance normalized by the length of the shorter string. However, in a previous version of the text, the metric was described as the Levenshtein distance normalized by the length of the longer string. The previous description was paired with an equation which has been since removed and with a table that showed a comparison of the proposed normalized Levenshtein distance and the Jaccard index. That table was removed because the comparison that was made between the two metrics was not relevant.
The Values in Table 1. The values reported are different to those previously reported. The source of the differences is twofold. The first is that the environment in which the values were calculated was also being used for active development and feature engineering; the iterative development and experimentation process changed the underlying raw data. The second source is the role random sampling plays in the masked language-based models. Described in more detail in Section 4.5 and Figure 1, these models classify each token in the input text. As a part of the classification process, the models generate a vector embedding for each input token. The input vector embeddings are compared with the mean vector of some random samples from two precomputed collections of token-vector embeddings. One collection contains vector embeddings for tokens that are known to not be associated with data sets and the other collection contains vector embeddings for tokens that are known to be associated with data sets. The classifications scores for each of the input tokens are based on their similarity to the mean vector from the data set and non-data set collections. In the case of the masked language-based models submitted to
COL22 competition (Nguyen & Nguyen, 2021), the number of samples in the data set and non-data set collections of vector embeddings are over 70,000, of which, the model samples 100 for each batch of tokens. In our evaluation this sample size seems too small to capture the variance in the underlying distribution of vector embeddings. This issue manifests itself in the form of a large variance in model performance with respect to both
precision and recall. As can be seen in Figure 2, the model performance is heavily skewed towards 0 for both precision and recall on the validation data set. Occasionally, the models will have high precision or recall scores (never both), which seems to indicate that there are portions of the vector-embedding distribution that are useful for inference, but that a substantial portion of the vector-embedding distribution is not useful. In contrast, our
implementation of the same approach, the right-most panel in Figure 2, samples 99% of its vector-embedding collections and has a small variance in performance with respect to precision and recall.
The Values in Table 5. The reported values are different from the previously reported values in that the number of detections for the Regex Enhanced model increased by 7, and therefore, the number of detections in the ensemble also increased by 7. The precision values are also slightly higher (≤0.02) for the individual models and 0.06 for the ensemble.
To promote open and reproducible science we have published a repository with the required model data files via Zenodo with the following DOI 10.5281/zenodo.10993568. Further, there is a Jupyter Notebook in the code repository that can be run to reproduce the values in Table 1.
©2024 Ryan Hausen and Hosein Azarbonyad. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.