Skip to main content
SearchLoginLogin or Signup

Discovering Data Sets in Unstructured Corpora: Discovering Use and Identifying New Opportunities

Published onApr 02, 2024
Discovering Data Sets in Unstructured Corpora: Discovering Use and Identifying New Opportunities
·

Abstract

Federal statistical agencies are keenly aware that scientific research is not the only way in which their data assets are used for evidence building. Entities around the world have been building services on top of curated corpuses of scientific research, to help provide insight as to the importance of the works in the collection. This also provides a much-needed framework for statistical agencies and other data set creators to search and find the usages and impacts of those data sets. While the other articles in this special issue largely apply machine learning models to find data sets in an extensively curated corpus, this article starts with a much less structured framework and examines the potential to discover how data sets are used in writing that is targeted at a broader base of users than scientific researchers. It describes the challenges and lessons learned from the exercise; and highlights an ever-growing value proposition for organizing collections of work.

Keywords: Cooperative Extension Service, Beautiful Soup, National Agricultural Statistics Service, web scraping, Evidence Act, data set impact


1. Introduction

Federal statistical agencies are keenly aware that scientific research is not the only way in which their data assets are used for evidence building. Their data assets, both data sets and reports (Dukes, 2015), can be used in many other ways, including in writings targeted at a broader base of users than scientific researchers. The U.S. Department of Agriculture’s National Agricultural Statistics Service (USDA’s NASS) provides a particularly salient use case. It is charged with producing data with a goal that it be used by a broad constituency—including staff, researchers, survey respondents (mostly American farmers), policymakers, and other government agencies. Good-quality information on how NASS data are used, and which data assets are most valuable to farmers, is critical. NASS can use this information to brief field interviewers for the purpose of improving survey responses and invest in those assets that have the highest utilization and impact.

This article examines the potential to provide NASS with that evidence. It reports on a pilot approach to find how USDA’s NASS data assets are cited and used in a much less structured body of writings than the highly curated corpus described in other articles in this special issue. It also shows how missed opportunities to credit NASS for data production can be identified and potentially addressed.

USDA, like many other agencies, has an institutional structure designed to translate research results to the broader community, its National Institute of Food and Agriculture (NIFA). NIFA funds and manages the state and local Cooperative Extension (hereafter Extension) system1 (Alston et al., 2010; Alston et al., 1999; Lane, 2020) with participation from more than 100 land grant institutions, each of which produces nonacademic reports designed to inform the public. These Extension reports are heavily used by the same customer base that NASS targets as its primary source for collecting data to create its data sets: American farmers.

However, using automated machine learning tools to trace how NASS data sets have been used in the Extension reports presented a number of significant challenges that are likely to also be encountered by other federal agencies embarking on similar efforts. It also presented an opportunity to identify citation gaps and remediate them.

A major challenge is due to the lack of structure in the way in which the data can be identified and accessed. There is a lack of structure in nomenclature. Many NASS data assets—like Hog Reports, or Crop Production Reports—do not have consistent and direct citation identifiers or are referenced using URLs, with the URLs being unique to a specific iteration of the release. There is also a lack of structure in consolidating Extension report output. In particular, the Extension Service is decentralized, with each institution hosting their output on their own website. That means that over 100 sites and underlying pages needed to be web-scraped to pull in all the relevant documents in a standard format. There is a lack of structure in the reporting outputs of each institution that are not available in a structured machine-readable format. Finally, there is lack of attribution to NASS data sets. Possibly because of the lack of systematic identifiers, many nonacademic research outputs do not cite the data set that was used, either in the text or in the references. The opportunity is to identify patterns in the attribution gaps referred to as ‘missed citation opportunities’ so that it is possible to develop remediation strategies.

Our contribution is to document how each of these challenges was addressed, and an illustrative way of addressing the missed opportunities. The article also provides code that can be reused or improved by other agencies who also need to organize a web-based heterogeneous corpus of articles to find how data are mentioned.

The first step was to create a machine-readable corpus. That required a systematic approach to finding the ways in which Extension reports are made available to the public (for example, as PDF reports or database releases) and convert the plethora of Extension websites2 and website links that we are targeting for discovery of those data assets into a set of machine-readable documents.

The second step is to use several different approaches, both string searches and more advanced machine learning approaches based on an earlier Kaggle competition (Kaggle, 2021; Lane et al., 2022) to find the ways in which data assets are mentioned in those machine-readable documents.

The final step is to use a simple bag-of-words (BoW) approach, which measures word frequency within a document, as input features to train a classifier to help identify nonacademic research output that could have cited NASS data, but did not. The missed citation opportunities for NASS visibility have two possible causes. One is that the research used NASS data but did not cite it. Another is that the research could have used NASS data but did not.

The article concludes with lessons learned and identifies a set of practical steps that agencies can take to ensure that more uses of their data can be readily discovered and identified.

2. The Broader Context

The broader context has been fully addressed in another article in this special issue (Potok, 2024). Briefly, the Federal Data Strategy, the Foundations for Evidence-Based Policymaking Act (2019), and the Information Quality Act (2000) all provide agencies with principles and practices to improve the quality and availability of their assets, both to the public and other agencies. The Office of Management and Budget continues to issue guidance to implement these statutes and direct agencies to make better and more efficient use of their data assets (Vought, 2019). In 2022, the Advisory Committee on Data for Evidence Building, which was mandated in the Evidence Act, delivered a report with recommendations to the Office of Management and Budget, including a path to establishing a National Secure Data Service (NSDS) (Advisory Committee on Data for Evidence Building, 2022). While a number of recommendations (3.6, 3.7, 5.3, and 5.7) are germane, two are particularly relevant for this article. Recommendation 5.2 recommends that the NSDS “provide a technological process to support access to searchable and discoverable data” to support the discovery of data assets for evidence building. In addition, recommendation 5.4 recommends that NSDS should collect and house a searchable inventory of projects that highlight what data sets are being used and for what purposes. This article shows how a searchable inventory can fulfill the requirements of both recommendations, and especially how critical it is for the latter one. This article also documents the challenges of using other approaches.

State and local entities are particularly important producers and users of statistical data. The Bureau of Labor Statistics, for example, has a well-established state network of labor market information agencies; the Census Bureau has State Data Centers; the Bureau of Justice Statistics has Statistical Analysis Centers. USDA is no exception with its Extension Service mentioned previously.

A major challenge for most agencies seeking to connect with local communities is the distributed nature of the state system. In the case of USDA, however, the Extension directors and administrators created the Extension Foundation in 2006 to help increase capacity and scale programs and innovation. As part of that mission, the foundation has been developing a chat service (ExtensionBot) that will require services to be built around Extension’s decentralized content to make their information machine readable (Appendix).3 This service is being developed incrementally and, while each Extension entity had its own authority to participate, some information is now being scraped from the publicly available websites.

Furthermore, Congress when it passed the Consolidated Appropriations Act of 2021 requested the USDA to create a blue-ribbon panel tasked with investigating ways to strengthen and broaden the impact of the land-grant system through increased collaboration. To address this request, NIFA sought assistance from the National Academies of Sciences, Engineering, and Medicine (referred to as the National Academies). The National Academies, in response, established the Committee on Enhancing Collaboration Between Land-Grant Universities and Colleges. The committee’s objective was to explore how collaborative efforts can be enhanced to boost knowledge generation, problem-solving, and the creation of opportunities within the food and agricultural knowledge system. The committee’s 2022 report includes a suggestion for NIFA to encourage collaboration across the decentralized land-grant system by convening information exchange or sandbox workshops. A centralized repository or service to colocate the critical work of the Extension Service could provide just that, as linkages between experts on similar topics or usages of the same data sets could be a seamless feature of the central platform. Similarly, “uniform, shared data management systems that enable seamless access to emerging information” is one of the principles for success mentioned in the report (National Academies of Sciences, Engineering, and Medicine, 2022).

In principle, activities like these that provide centralized services could make data usage more discoverable—and thus help make it easier for agencies like USDA to inform their user community and appropriators about past uses or the potential to use their data sets for evidence building.

3. Finding Data Assets in Nonscientific Publications on the Web

Finding how data assets are used in nonscientific publications is challenging, as noted in the introduction, due to lack of structure,

The following section describes how we addressed each of these issues.

Lack of standard nomenclature: In common with many agencies, NASS has multiple ways in which data assets can be identified by researchers and others in their publications. The first is simply to use the official names of the data sets: the Census of Agriculture and the Agricultural Resource Management Survey, or their abbreviated acronyms, for example. The second is to refer to USDA official publications that are in the form of PDFs and provide some narrative around the underlying data set. NASS provided the team with a list of 22 such publications as well as links to the publications. It turns out that each official publication also has a number of alternative links, typically associated with the different releases or versions of the publications. These aliases had to be extracted from the non-editable HTML source code of the original publication homepage. For instance, the “Agricultural Prices” official publications had 11 different releases from June 30, 2022, until March 31, 2023, which were labeled as “agpr0622.pdf” for (June 30, 2022, release), and “agpr0323.pdf” for (March 31, 2023, release) in the HTML when its original link was “https://usda.library.cornell.edu/concern/publications/c821gj76b.” When factoring in all the aliases for the 22 publications, a total of 233 search strings (representing the data assets to discover) were extracted.

Lack of consolidation: The second issue is that the publications possibly using the data assets are hosted on a multitude of different sites. The Extension Foundation provides a search service, which is confined to 158 Extension focused websites. These 158 websites would serve as the corpus for our discovery of NASS data set usages. But these websites are all structured differently. In particular, all have different approaches to site hierarchy, all utilize a variety of modalities for disseminating content (PDF, HTML, video, etc.), and each have different securities in place to prevent nefarious acts by external entities.

The structural challenges are substantial. The root sites (first level) of each of the 158 websites had wide ranges in frequency of one-level-down sublinks (second level). To quantify, the leading university’s Extension page had 574 sublinks just one level down from the root site, while the Extension page that had the least number of sublinks one level down from the root had only three sublinks. Of course, each of these sites then had their own unknown number of levels further down their hierarchy where data set usage may have occurred.

As illustrated in Figure 1, the combination of 7,886 second-level Extension Service URLs (content possibly using the data assets) and 233 aliases of official publications provided by USDA (data assets) resulted in 1,837,438 possible citation opportunities to look for. This volume of citation opportunities grew very rapidly and posed one of the study’s most significant challenges.

1a. Structure of the Corpus

1b. Listing of 22 Official USDA Publications

Official Report Name 

Link to Report on Cornell

Agricultural Prices

https://usda.library.cornell.edu/concern/publications/c821gj76b

Farm Labor

https://usda.library.cornell.edu/concern/publications/x920fw89s

Agricultural Land Values

https://usda.library.cornell.edu/concern/publications/pn89d6567

Farms and Land in Farms

https://usda.library.cornell.edu/concern/publications/5712m6524

Farm Production Expenditures

https://usda.library.cornell.edu/concern/publications/qz20ss48r

Farm Computer Usage

https://usda.library.cornell.edu/concern/publications/h128nd689

Census of Agriculture

https://www.nass.usda.gov/Publications/AgCensus

Crop Production

https://usda.library.cornell.edu/concern/publications/tm70mv177

Grain Stocks

https://usda.library.cornell.edu/concern/publications/xg94hp534

Acreage

https://usda.library.cornell.edu/concern/publications/j098zb09z

Crop Production Annual Summary

https://usda.library.cornell.edu/concern/publications/k3569432s

Crop Progress

https://usda.library.cornell.edu/concern/publications/8336h188j

Hogs and Pigs

https://usda.library.cornell.edu/concern/publications/rj430453j

Cattle on Feed

https://usda.library.cornell.edu/concern/publications/m326m174z

Cold Storage

https://usda.library.cornell.edu/concern/publications/pg15bd892

Milk Production

https://usda.library.cornell.edu/concern/publications/h989r321c

Broiler Hatchery

https://usda.library.cornell.edu/concern/publications/gm80hv35d

Cattle

https://usda.library.cornell.edu/concern/publications/h702q636h

Livestock Slaughter

https://usda.library.cornell.edu/concern/publications/rx913p88g

Dairy Products

https://usda.library.cornell.edu/concern/publications/m326m1757

Chickens and Eggs

https://usda.library.cornell.edu/concern/publications/fb494842n

Honey

https://usda.library.cornell.edu/concern/publications/hd76s004z

Figure 1. (a) The structure of the corpus; (b) list of U.S. Department of Agriculture (USDA) official publications considered.

Lack of machine-readable output: Once these citation opportunities were identified, the next step was to turn the web content of the Extension Service into machine-readable text. Web scraping4 was used to automatically collect the textual data from the 7,886 second-level URLs in a structured format to search for possible citations of USDA NASS official publications. The required steps were: (1) Identify the target website (7,886 second-level URLs of Extension Service); (2) inspect the HTML code of the target web page(s) to identify the elements to extract (for this work we considered sublinks and PDFs); (3) leverage appropriate library packages (Beautiful Soup, Scrapy, or Selenium) to extract desired information; (4) run the scraper; and (5) store the data in a spreadsheet or database for further analysis or use.

The tool that was chosen was Beautiful Soup, which is a widely used Python library that provides a powerful and easy-to-use tool that can parse HTML and XML documents. It works by parsing the HTML or XML document and creating a parse tree, which is a hierarchical structure that represents the document’s structure. Beautiful Soup then allowed us to navigate and search through this tree to extract data and information from the document. When parsing an HTML or XML document, Beautiful Soup first breaks down the document into a series of nested data structures called tags. These tags represent the different elements in the document, such as headers, paragraphs, and links. We then used Beautiful Soup to navigate this tree of tags using methods such as find_all() and select(). These methods allowed us to search for specific tags or attributes, extract data from those tags, and navigate to other tags and their attributes. One of the key advantages of Beautiful Soup is its ability to work with imperfect or incomplete HTML documents, and still extract relevant information with ease.

The security policies in place for the websites presented another hurdle. When running the parser against certain extension sites, several had denial of service protections to prevent multiple calls. This required a more manual and time-consuming approach to parsing the HTML code.

Once the target data set references were identified and the HTML text converted to XML, simple string matching was applied to find whether the identified data assets were mentioned in any of the documents. However, this effort revealed no solid reference to the NASS data assets (or any of the 233 aliases) in the 158 Extension Service (first-level) URLs or the 7,886 second-level sublinks, regardless of modality. Table 1 summarizes these results.

Drilling down further into the site hierarchies may have discovered a more appropriate level where content was making use of the NASS data, but was not feasible for this case study given time and funding constraints, and would not have yielded a scalable solution.

Table 1. Summary of web search results.

Item

Level 1
(Top level)

Level 2
(Under top level)

Total number of links after removing duplicates

158 links (158 unique)

7,886 (4,117 unique)

Links not returning results

3

6

Results for the 155 sites with URLs

Mean number of sub-URLs to column header

50.9

134.2

Variance

84.7

162.3

Mean number of PDFs

0.7

1.5

Variance

2.1

13.4

National Agricultural Statistics Service data set usages found

0

0

4. Attribution: Finding NASS Data Assets in Nonscientific Publications in a Curated Corpus

The previous section described the challenges associated with finding data in web-scraped publications across all Extension sites.

This section describes a more sophisticated approach that could be applied to a curated corpus to determine whether NASS data sets were attributed in Extension research. This approach was possible because of the ExtensionBot project mentioned earlier by the Extension Foundation to develop machine-readable Javascript Object Notation (JSON) formats for published extension reports. We were provided access to 1,545 machine readable Extension official publications for Oklahoma State (OKState) and 3,142 extension official publications from Oregon State (ORState). Two types of machine learning models derived from a recent Kaggle competition (Lane et al., 2022), were applied to this corpus. For this effort, the research focused on identifying usage of three particular USDA data sets. Those data sets were the Census of Agriculture, the Agricultural Resource Management Survey (ARMS), and the Rural Urban Continuum Codes (RUCC) data sets.

The first model was based on a relatively simple recognition approach from the Kaggle competition (Ming, 2021). It started with a set of training JSON files that each contain text for an article. After training, the solution read unprocessed papers as JSON files. The preprocessing required extracting all the text as a list of strings (the paper’s text was assumed to be the one mapped through the ‘text’ key in the dictionary). This preprocessing involved multiple steps. First, the text was segmented into sentences. Next, potential data set names were recognized using various techniques, including capitalization patterns, associated acronyms, and instances where names are frequently found in proximity to the term "data" or incorporate a concise set of keywords (such as "data set," "survey," etc.).

The second model used a machine learning approach that trains a transformer classifier based on whether the string is a data set/survey/report also based on the results of the Kaggle competition (Nguyen & Nguyen, 2021). The model used the Schwartz-Hearst algorithm to identify LONG-NAME (acronym) pattern candidate strings using the pretrained Hugging Face binary classifier and setting up minimum document frequency. The model was modified by training the pretrained classifier to include a few specific aliases for the three USDA data sets that were of interest to the researchers.

The first approach found mentions of only 15 data sets from the more than 3,000 Oregon State official publications. Of those, only six were either Census of Agriculture or ARMS, and no RUCC usages were identified. It found only 20 data sets from the more than 1,500 Oklahoma State articles, surveys, and papers, of which only three were Census of Agriculture or ARMS data.5

The second approach identified 17 data assets from the Oregon State official publications, among which six were either Census of Agriculture or ARMS. It found only 32 data asset mentions from the Oklahoma State official publications, among which six were either Census of Agriculture or ARMS. For example, in the OKState’s publication Agritourism in Oklahoma, the second Kaggle Model identified that the author cited the direct spending amount of Agritourism in Oklahoma by the “Census of Agriculture” in the body of the article. Another example from both the second-place and third-place Kaggle Models would be the OKState’s publication Plants in the Classroom: The Story of Oklahoma Pecans where the author cited the “Census of Agriculture” as well.

A frequency report with the findings of all the potential data assets discovered within the JSON files are listed in Table 2 and Table 3.

It is possible that it is too naïve to expect data sets to be named in Extension Service official publications, or maybe our results were the extent of Extension usages of NASS data sets. It is also possible that the data sets were being referred to as hyperlinks. To capture hyperlink usages, a more sophisticated approach was used. An algorithmic web scraper was applied to both the Oregon State and the Oklahoma State JSON corpora to create two new corpora consisting of all hyperlinks found in the articles that were part of the original corpora (Onan, 2019; Park & Thelwall, 2003; Safder et al., 2020; Tuarob et al., 2016).

Initially, some links were broken and stopped the program from being able to run by timing it out. Therefore, a restriction was placed such that each article had a maximum time of 10 seconds to be scraped. This resulted in a minimal loss of links.

Then, the new data set of links was searched for using root link sources where data sets are expected to be found (based on sources provided by the research partners).6 In the case of Oregon State University, a total of 72,336 sublinks were found in the 3,142 official publications. Of these, 14 links to data assets were found. Four of them led to general links that do not reference specific data (e.g., https://quickstats.nass.usda.gov/).

We also tried a broader method to select hyperlinks simply containing “USDA” in the URLs and conducted a manual check on all of them. Among the 69,845 Oklahoma State sublinks, we found 91 hyperlinks containing “USDA” in URLs, but only three related to the Rural-Urban Continuum Codes data set, and none for the Census of Agriculture or ARMS data sets. Eight of them referred to the “quick stats link” of the National Agricultural Statistics Service.

This very low hit rate was validated by manual review of a small, randomly selected number (25) of official publications from each of the two JSON corpora (50 total) and finding all links contained within. We then examined each of the 50 links and manually identified every link on the corresponding host web page. A relevant link was defined as one that either was in the main body of the page text itself or one that directly followed the article text in the form of either a More Information, Related Articles, or Bibliography section. Links to images and navigational links were not included in this analysis. The relevant hyperlinks were then placed in a spreadsheet where each entry was analyzed and followed to determine the nature of the hyperlink (i.e., another article, data set).

In the case of Oregon State, a total of 47 links were found in the 25 sampled articles resulting in an average of 1.88 links per official publication. Many of these 47 links (18, or an average of 0.72 links per article) were listed under a More Resources tab under the article’s body. An example of this is shown in Figure 2. Similar tabs were found in most articles, leading us to believe that the reason for the relatively low number of data sets when compared to the number of links was that most links connected to other university-hosted content.

Figure 2. A more information section after an article—Oregon State University.

To break it down, of the 47 hits for Oregon State University, 15 were to Subtopic (a link to a related topic found elsewhere on the site); three were to videos (a link to a YouTube video referenced in the article); two were to a collection (a collection of resources and articles connected to a topic); nine were to an article (another article on the Oregon State website); five went to a PDF (a link to a PDF); and 11 went to other (sources that did not fit within any of the previously mentioned categories7).

In the case of Oklahoma State, a total of 63 links were found at the 25 sampled articles for an average of 2.52 links per article. The average would be significantly lower if not for three articles that had 10 or more links each. All original sample articles had text and were accessible, but there were nine articles that had no links in their text body or anywhere else in the article. This corpus of 63 links can also be categorized: two for data (for links where data can be found); four were broken links (links which do not work and whose purpose cannot be ascertained); 21 PDFs; seven organizations/promotional material for organizations (e.g., the homepage for Oklahoma State or social media links); 17 were Oklahoma State extension surveys/fact sheets, and eight were “other.”

This manual analysis helps confirm the results of the model-generated outputs, that most of the links within the JSON are not links to data assets.

5. Potential to Identify and Exploit Missed Opportunities

The limited discovery of direct citations of USDA data and data assets in the USDA Extension Program is unsurprising. Earlier project work (Allen, 2020; Lane et al., 2020) revealed the challenges of working with uncurated and unstructured data. However, the results have helped surface key audiences that could be targeted by agencies in their data dissemination strategies. This was achieved by implementing ML approaches to (1) identify missed opportunities for direct NASS report citations in the university Extension Service publications, and, (2) discover the topics covered in publications identified in two specific Extension Services, namely, Oklahoma State and Oregon State JSON corpora. Those approaches are described below.

5.1. Using Machine Learning to Identify Missed Opportunities

The limited utilization of NASS reports as citations in the Extension Service raises the possibility of missed opportunities for appropriate referencing. To identify the potential for such missed opportunities, we applied another type of machine learning framework as a possible approach. The framework utilizes the BoW as features to train a classifier to estimate the probability of an Extension Service’s document either requiring a citation to the USDA reports, or an opportunity for the document to be strengthened by using an official statistic. Figure 3 shows the general scheme of the proposed framework followed by text that describes each of the two major components of the proposed framework, namely, BoW and classification method, in more detail.

Figure 3. General scheme of the proposed machine learning framework for estimation of U.S. Department of Agriculture (USDA) report classification missed opportunities.

Bag-of-words is a versatile and commonly employed technique for text representation in natural language processing tasks. When applied to estimating the probability of an Extension Service’s document requiring a citation from the NASS Reports, BoW can serve as a powerful tool for training a classifier. Each document was first tokenized into individual words, and a vocabulary was constructed by considering the unique words across all documents. The frequency of occurrence of each word in a document was then computed, resulting in a vector representation known as the document’s BoW feature vector. We then trained a simple classifier to learn the relationship between the document’s textual content and the probability of it requiring a citation from the USDA reports by mapping the BoW feature vectors to their corresponding citation labels.8 The TF-IDF (term frequency-inverse document frequency) was incorporated; this assigned weights to the words based on their importance in the document and across the entire corpus, capturing the relative significance of each word more accurately. Only the top 35 frequent words in the training set were used as the set of input features; and to keep the model simple, no ordering (n-gram) information was considered. We designed a simple feed forward neural network (Glorot & Bengio, 2010) with one hidden layer containing 70 nodes to predict the appropriate citations of a given Extension Service’s document based on its input feature vector. We consider the validation set method with 90% train and 10% validation for training and evaluation of the neural network model performance, which was measured by ‘Accuracy.’ We also used the ‘Adam’ optimizer and 50 training epochs for learning the parameters (Kingma & Ba, 2014). The proposed neural network showed 90% accuracy in predicting the correct citations of NASS reports (Flach, 2019). The success of this simple approach suggested that there is potential for more sophisticated techniques to be used in identifying missed citations or opportunity for citations in documents.

Such a trained model could be used in a number of ways. One is to identify community outreach opportunities. It could provide NASS with information about how many reports are in areas that could use NASS data, but that either do not use or do not cite NASS data. NASS could target outreach efforts to the relevant Extension programs and either encourage them to directly cite NASS data, or develop an incentive system to reward them for appropriate citation (Lane et al., 2024). Another is to provide NASS with information about how to make their data products more readily citable (Puebla & Lowenberg, 2024). Finally, NASS could use the information to identify areas of opportunity for new data—topics and questions in which reports are being generated and that could be informed by the production of new NASS data.

5.2. Using Machine Learning to Discover the Topics Covered by Publication

A parallel activity was conducted using the Oklahoma State and Oregon State Extension corpus, which enabled a deeper dive. The latent Dirichlet allocation (LDA) multiscore method of the Gensim package in Python was used to train the specific LDA model using each university’s Extension corpora. The arbitrary topic number limit was set to 20, with 10 words in each topic. This provides some initial insights into which topics have similarities with existing USDA data assets, and could be referring to USDA data, but such references either do not exist or are not discoverable. Figures 4a and 4b provide an overview.

Figure 4a. The main topics covered in Oklahoma State University reports.

Figure 4b. The main topics covered in Oregon State University reports.

6. Future Possibilities and Recommendations

The artificial intelligence (AI) tools used in this work, combined with the rise in popularity of ChatGPT and other AI tools, present a unique opportunity to discover how data sets are used in a wide variety of reports. However, the challenges encountered by this study may still prevent development of operational and scalable solutions. The Cooperative Extension Service provides an extraordinarily useful case study of what might be possible if a coordinated and proactive approach is put forth to address these challenges.

Currently, Extension reports are often scattered among various departments and organizations at their institutions. In line with recommendations from the Committee on Enhancing Coordination Between Land-Grant Universities and Colleges, NIFA could advocate for the construction of consistent data pipelines that can bring institution-wide data to a single source and format. Researchers and analysts could potentially access such a consolidated central site or service, and the results posted on a data dashboard like the Democratizing Data dashboard produced for scientific publications. This central site or service for all Extension content would eliminate the biggest challenge faced by this use case, and offer a more streamlined communications channel for the public, the larger Cooperative Extension system, and the institutions as they seek greater collaboration and convergence on critical research needs.9

A key feature of any initiative should be that the design should be dynamic, not static, and that all participants benefit. Extension institutions should see value in being involved, and in updating their content regularly so that new content is added and stale content removed—and the end users, the farmers and producers can readily find and use the information.

NASS’s experience is similar to that of other federal statistical agencies. As part of the general effort to build structured data, there are clear operational improvements for NASS and for the federal statistical system. Formal citations make it easier to identify data sets but cost the author’s time to formulate. That cost would be reduced if the data, both in report format and when using the Quick Stats query tool, had a digital object identifier (DOI)—which NASS and other agencies could incorporate into their dissemination workflow.10 Many publications used hyperlinks as citations, which are easier to use, but increase the difficulty of identifying and tracking the usage of data sets.11

Some Extension-wide standards for authors could also be helpful. For example, when graphs and figures are included in reports, it would be helpful to include the source data, as per the Nelson memo (Nelson, 2022). In addition, a report workflow of some sort could prompt an author to provide citation; and mechanisms available at the website of an agency’s data set can include an import citation feature with just a few clicks.

Finally, the potential to incentivize report writers to cite data is enormous. Data experts and domain experts can be celebrated and rewarded through the connection made between them and the data through a citation link (Lane et al., 2024; Spector et al., 2022). The infrastructure that is now being built through search and discovery platforms being developed by USDA (such as the 5Ws Dashboard produced by NASS) and the Democratizing Data project being developed by a consortium of statistical agencies can be designed to formalize the reward structure. For example, a dashboard could be automatically produced by the Extension Foundation filterable by each state institution – providing a service to the public and giving credit where credit is due to authors, the Extension Service, and USDA.

NASS has been proactively responding to the findings of this work and reaching out to Extension Service staff to work with them as a collaborative team to discuss ways of making this possible. All of this in turn relies on federal staff understanding and possessing the skillsets needed to support and tap into such infrastructures, build and support dashboards, and catalog, classify, and label data sets.

Legislative mandates and other federal policy, like the Evidence Act and the Year 2 Report of the Advisory Committee on Data for Evidence Building (2022), emphasized the importance of documenting the value of data assets for evidence building. The results of this case study make it clear that there is much work left to be done in order to meet these needs. A coordinated effort across federal agencies would be in the spirit of a National Secure Data Service, in identifying scalable approaches to be adopted across the federal statistical system.

The importance of the approach is indisputable. Many uses of federal data in general, and the federal statistical system in particular, go well beyond the scope of scientific publications; this article identifies one path to identifying and expanding such use. We hope that this work provides some ideas that could be followed by the federal government more broadly as they look to surface the impact of federal data investments across a wide range of stakeholders.


Disclosure Statement

Nick Pallotta, Mark Locklear, Xiangyu Ren, Victor Robila, and Adel Alaeddini have no financial or non-financial disclosures to share for this article.


References

Advisory Committee on Data for Evidence Building. (2022). Advisory Committee on Data for Evidence Building: Year 2 Report. Washington DC. Retrieved from https://www.bea.gov/system/files/2022-10/acdeb-year-2-report.pdf

Allen, R. B. (2020). Metadata for social science datasets. In J. I. Lane, I. Mulvany, & P. Nathan (Eds.), Rich search and discovery for research datasets (pp. 40–52). Sage.

Alston, J. M., James, J. S., Andersen, M. A., & Pardey, P. G. (2010). A brief history of US agriculture. In Persistence pays (pp. 9–21). Springer. https://doi.org/10.1007/978-1-4419-0658-8_2

Alston, J. M., Pardey, P. G., & Smith, V. H. (1999). Paying for agricultural productivity. International Food Policy Research Institute.

Dukes, C. W. (2015). Committee on National Security Systems (CNSS) Glossary, CNSSI No. 4009. Department of Defense.

Flach, P. (2019). Performance evaluation in machine learning: The good, the bad, the ugly, and the way forward. Proceedings of the AAAI Conference on Artificial Intelligence, 33(1), 9808–9814. https://doi.org/10.1609/aaai.v33i01.33019808

Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Y. W. Teh, & M. Titterington (Eds.), Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (pp. 249–256). Proceedings of Machine Learning Research. https://proceedings.mlr.press/v9/glorot10a.html

Information Quality Act, Pub L. No. 106-554, 114 Stat. 2763. (2000). https://www.congress.gov/106/plaws/publ554/PLAW-106publ554.pdf

Consolidated Appropriations Act, Pub. L. No. 116–260, 134 Stat. 1182 (2021). https://www.congress.gov/116/plaws/publ260/PLAW-116publ260.pdf

Kaggle. (2021). Kaggle: Show US the data. Retrieved February 9, 2022, from https://www.kaggle.com/c/coleridgeinitiative-show-us-the-data

Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. ArXiv. https://doi.org/10.48550/arXiv.1412.6980

Lane, J. (2020). Democratizing our data: A manifesto. MIT Press.

Lane, J., Gimeno, E., Levistkaya, E., Zhang, Z., & Zigoni, A. (2022). Data Inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.8a3f2336

Lane, J., Mulvany, I., & Nathan, P. (Eds.) (2020). Rich search and discovery for research datasets: Building the next generation of scholarly infrastructure. Sage.

Lane, J., Spector, A. Z., & Stebbins, M. (2024). An Invisible hand for creating public value from data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.03719804

Ming, L. C. (2021, October 20). Transformer-enhanced heuristic search [Conference presentation]. Coleridge Initiative’s Show US the Data Conference, online. https://youtu.be/H3uOkBzsAFg?si=x99Bq6-wWmvUnFQB&t=1768

National Academies of Sciences, Engineering, and Medicine. (2022). Enhancing coordination and collaboration across the land-grant system. The National Academies Press. https://doi.org/10.17226/26640

Nelson, A. (2022). Ensuring free, immediate, and equitable access to federally funded research [Memo]. Retrieved from https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf

Nguyen, T. K., & Nguyen, Q. A. M. (2021, October 20). Context similarity via deep metric learning [Conference presentation]. Coleridge Initiative’s Show US the Data Conference, online. https://youtu.be/H3uOkBzsAFg?si=x99Bq6-wWmvUnFQB&t=1768

Onan, A. (2019). Two-stage topic extraction model for bibliometric data analysis based on word embeddings and clustering. IEEE Access, 7, 145614–145633. https://doi.org/10.1109/ACCESS.2019.2945911

Park, H. W., & Thelwall, M. (2003). Hyperlink analyses of the World Wide Web: A review. Journal of Computer-Mediated Communication, 8(4), Article JCMC843. https://doi.org/10.1111/j.1083-6101.2003.tb00223.x

Potok, N. (2024). Data usage information and connecting with data users: U.S. mandates and guidance for government agency evidence building. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.652877ca

Puebla, I., & Lowenberg, D. (2024). Building trust: Data metrics as a focal point for responsible data stewardship. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.e1f349c2

Safder, I., Hassan, S.-U., Visvizi, A., Noraset, T., Nawaz, R., & Tuarob, S. (2020). Deep learning-based extraction of algorithmic metadata in full-text scholarly documents. Information Processing & Management, 57(6), Article 102269. https://doi.org/10.1016/j.ipm.2020.102269

Spector, A. Z., Norvig, P., Wiggins, C., & Wing, J. M. (2022). Data science in context: Foundations, challenges, opportunities. Published online. Cambridge University Press.

Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2016). AlgorithmSeer: A system for extracting and searching for algorithms in scholarly big data. IEEE Transactions on Big Data, 2(1), 3–17. https://doi.org/10.1109/TBDATA.2016.2546302

Vought, R. (2019). Improving implementation of the Information Quality Act [Memo]. https://www.whitehouse.gov/wp-content/uploads/2019/04/M-19-15.pdf


Appendix

ExtensionBot Import JSON Specification v1.0

This JSON object is a description of a resource that you would like to make available through the ExtensionBot chatbot. It will be added to the knowledge registry, and it could point to any format resource you would offer to launch from a web page. We will not be directly copying your resource, only providing links to it. The more information you can provide about your resource, the more accurately it can be suggested to users. You can submit multiple resources by putting an array of these objects in a single JSON file. 

Value

Type

Frequency

Description

title

string

optional

Title of web page or resource

link

string

required

URL to web page or resource

thumbnail

string

optional

URL to thumbnail

institution

string

optional

Name of institution providing the resource

author

array of strings

optional

List of author(s) of web page or resource

publish_date

string

optional

Date web page or resource was published: YY-MM-DD

(Date of last update by source, not when exported to Eduworks)

content_type

string

optional

Type of web page or resource 

Default: HTML

content

array of JSON objects

required

An array of content objects (sections in the resource)

Each object includes: header, text, images

Each object should be added as HTML is traversed down a web page or resource—separated by ‘header’ if possible

content_header

string

optional

Header text—with HTML tags if possible

content_text

string

required

All text content—including HTML tags if possible—of web page or resource.

If HTML is not available, please include line breaks between paragraphs.

content_images

array of JSON objects

optional

An array of image objects. 

Each object includes: image_url, image_text, image_caption

image_url

string

optional

URL to image source

image_text

string

optional

Alternate text of image

image_caption

string

optional

Caption of image

category

array of strings

optional

An array of labels, taxonomy terms, or broad topics that categorize the web page or resource.

questions_answered

array of strings

optional

A list of questions that the referenced resource answers or responds to. This will help the chatbot to better understand and select best resources for questions that it receives.

Empty JSON Structure

[{

"title": "",

"link": "",

"thumbnail": "",

"institution": "",

"author": "",

"publish_date": "",

"content_type": "",

"content": [

{

"content_header": "",

"content_text": "",

"content_images": [

{

"image_url": "",

"image_text": "",

"image_caption": ""

},

{

"image_url": "",

"image_text": "",

"image_caption": ""

}...

]

},

{

"content_header": "",

"content_text": "",

"content_images": [

{

"image_url": "",

"image_text": "",

"image_caption": ""

},

{

"image_url": "",

"image_text": "",

"image_caption": ""

}...

]

}...

],

"category": ["", ""],

"questions_answered": ["", ""]

}...

]

Example JSON Object:

{

"title": "How can I get rid of the moss on my pavement?",

"link": "https://extension.oregonstate.edu/ask-expert/featured/how-can-i-get-rid-moss-my-pavement",

"thumbnail": "https://extension.oregonstate.edu/sites/default/files/images/2022-12/adobestock534169728.jpeg",

"institution": "Oregon State University Extension Service",

"author": "Chris Rusch, OSU Extension Douglas County",

"publish_date": "2023-02-22",

"content_type": "HTML",

"content": [

{

"content_header": "<h1 class=\"page-header\"><span class=\"notranslate\">How can I get rid of the moss on my pavement?</span></h1>",

"content_text": "<p>The moss is growing really fast on my driveway with the recent winter weather. I need a method to remove the moss that can be used in the wet season and is pet safe.</p> \n\n<p>Thank you for your question! Moss is a common annoyance amongst homeowners in the Pacific Northwest, especially during periods of higher precipitation.</p>",

"content_images": []

},

{

"content_header": "<h2>Moss removal</h2>",

"content_text": "<p>Here are some strategies for killing the moss and removing it from your property.</p> <p>Please note that if you choose to use bleach, you should <strong>never mix bleach with ammonia or other household cleaners.</strong> Mixing them will produce poisonous gas. It is also recommended to wear protective equipment such as rubber gloves and goggles when using bleach.</p> <ul><li><strong>Preparation: </strong>Sweep dirt and debris from the blacktop with a stiff broom. Cover nearby plants with plastic garbage bags to protect them from damage.</li> <li><strong>Bleach treatment:</strong> Combine 1 cup of household bleach with 1 gallon of water in a large bucket, and stir in 1 cup of liquid dish or laundry detergent. Douse small patches of moss with the solution, or apply it liberally to larger areas with a sprayer. Allow the bleach mixture to set for at least 5 minutes to kill the moss. Rinse the pavement thoroughly with the garden hose and flood the area generously. Bleach treatment may remain effective against recurring moss for up to a year.</li> <li><strong>Vinegar treatment:</strong> Alternatively, you can spray vinegar over the moss and algae deposits. Leave it for 15-20 minutes, and repeat the process for a week. Wash your driveway afterward to remove the excess vinegar.</li> <li><strong>Moss Removal:</strong> Spray the moss off the blacktop with the garden hose. Scrub stubborn patches with a stiff-bristle brush or lift them off with a wide, flat scraper, if necessary. These plants have no roots so the green carpets will lift easily.</li>",

"content_images": [

{

"image_url": "https://extension.oregonstate.edu/sites/default/files/styles/full/public/images/2022-12/img3002.JPG?itok=AjKg7Xxc",

"image_text": "This image of a mossy driveway was submitted by the Ask an Expert client. It also features their personal vehicle and the edge of their garden.",

"image_caption": "This image of a mossy driveway was submitted by the Ask an Expert client."

},

{

"image_url": "https://extension.oregonstate.edu/sites/default/files/styles/full/public/images/2022-12/adobestock534169728.jpeg?itok=XGHkpXQk",

"image_text": "Water pouring from a rain gutter onto a mossy residential driveway.",

"image_caption": "Moss thrives in moist environments, so trying to reduce the amount of water will prevent it from growing. This can be challenging in the rainy Pacific Northwest climate."

}

]

},

{

"content_header": "</ul><h2>Moss prevention</h2>",

"content_text": "p>Here are some tips for discouraging the moss from growing on your driveway.</p> <ul><li><strong>Sun exposure:</strong> Moss is a shade-loving plant that does not thrive in sunny conditions. Prune trees, shrubs and ornamentals back to allow as much sunlight as possible to shine on the pavement. Keep grass and plants bordering the area trimmed short to reduce shading along the edges. The short walls near your driveway may be blocking sunlight and creating an ideal environment for moss growth.</li><li><strong>Water restriction:</strong> Moss grows best with plenty of moisture. Adjust lawn sprinklers to keep them from watering the pavement, and try to avoid splashing the pavement when you hand water nearby planting areas.</li>",

"content_images": []

}

],

"category": ["pests", "weeds", "diseases", "gardening", "lawn", "landscape"],

"questions_answered": ["How can I discourage moss from growing in my pavement?", "How do I remove moss from my pavement?"]

}


©2024 Nick Pallotta, John M Locklear, Xiangyu Ren, Victor Robila, and Adel Alaeddini. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?