Data access, use, and reuse are crucial for empirical science and evidence-based policymaking, and rely on metadata to facilitate data discovery and utilization by users and producers alike. Metadata quality is pivotal for the federal government to understand data available for evidence building, enabling agencies to identify data production gaps and redundancies, and enhancing evidence quality through reproducible research. The alignment of federal data agency incentives with the private sector, alongside technological advancements, now supports feedback-driven data classification, leveraging machine learning for improved data discoverability and categorization. This article outlines the multiple classification needs of one statistical agency, the National Center for Science and Engineering Statistics, and proposes a machine learning approach for classifying data sets based on usage in research, aligning with legislative and policy frameworks to enhance data governance, interoperability, and utility for evidence-based decision-making.
Keywords: data set search and discovery, metadata for data sets, labels for data sets, topic identification, research fields
Data access, use, and reuse are foundational to both empirical science and evidence-based policymaking. That foundation depends on data having labels so that two key contributors to the data ecosystem—users and producers—can search for and discover data. Just as labels on food cans provide information to shoppers, stores, and warehouses, data labels (or metadata) can provide information to data users so that they can find data sets useful for their work and to data producers so that they can find which parts of their data portfolios are most useful to serving their missions.
In practical terms, one end result of building a foundation of high-quality metadata is that the federal government can have better information about the data available for evidence building (National Academies of Sciences, Engineering, and Medicine, 2022). Data producers from different agencies can answer questions such as ‘What agencies are producing data about the science and engineering workforce, food security, climate change, advanced manufacturing, synthetic biology, or artificial intelligence?’ so that all agencies have better information on gaps and redundancies in the production of data, research, and evidence. Another end result is higher quality evidence because research is more likely to be reproducible. Because researchers can find answers to such questions as ‘What are the federal funded data sets about the science and engineering workforce, food security, climate change, advanced manufacturing, synthetic biology, or artificial intelligence?’ and ‘Who are the experts on topics using those data sets?’ existing empirical research can be more readily identified, reproduced, and built on to produce high-quality evidence.
Good product metadata is vital in the private sector because providing better information to shoppers helps drive sales; likewise, providing better information to stores helps inform product placement, and providing it to warehouses helps reduce costs. Companies like Google, Facebook, and Amazon spend billions of dollars trying to convey information about their respective products every day and gathering metadata about how well they have done so. As the former chief data scientist at Facebook and cofounder of Cloudera noted, “the best minds of my generation are thinking about how to make people click ads,” quickly following that with the statement, “That sucks” (Vance, 2011).
While federal data agencies have had different incentives and outreach approaches in the past, those have now changed to be more in line with the private sector (Lane et al., 2024). Their historical mandate has been to produce data that are of the highest possible quality, and they have not had the resources to spend on informing the public of their utility. The funding authorizations for data collection are typically not only agency- and context-specific but also quite broad. For example, the National Center for Science and Engineering Statistics (NCSES) was reauthorized in 2010 under Section 505 of the America COMPETES Act to “collect, acquire, analyze, report, and disseminate statistical data related to the science and engineering enterprise in the United States and other nations that is relevant and useful to practitioners, researchers, policymakers, and the public” (America COMPETES Reauthorization Act of 2010, 2011).1 Agency information about data use has most often been indirect in the form of federal advisory committees (Holland & Lane, 2018); there are few other options available. Simply put, there have historically been no government-wide common standards or ontologies to categorize data, since the reasons for agencies to collect data can be as heterogeneous as the data they collect. As a result, many agencies have historically either not created data catalogs or have used highly manual or rule-based approaches (Chief Data Officer Council, 2022a, 2022b).
Now, however, both the incentives for federal agencies and the technologies available to get feedback have changed (Potok, 2024). Agencies have to produce statistics on their product use as part of the Foundations for Evidence-Based Policymaking Act (2019; hereafter Evidence Act). New technologies like the Democratizing Data platform reduce the cost of getting feedback (Emecz et al., 2024) because the extensive existing metadata associated with publications—researchers and their research topics—could be used to provide information to classify how data are used in support of agency missions and research areas.
In this article we outline a possible machine learning approach to that classification task. The conceptual framework is that researchers make use of data to study a particular field. The words they use in articles they publish, and their persistent work in that field, can be analyzed and grouped to signal information about that topic (Klochikhin & Boyd-Graber, 2020). The framework, which is described in more detail below, can be customized for different agencies and across different research groups. In addition, the framework can be used to make data more easily discoverable by researchers—an explicit requirement in the Evidence Act (2019) is that statistical agencies provide a single standard application process (SAP) to researchers.
This article describes the approach with the particular use case of one statistical agency, the National Center for Science and Engineering Statistics, which has been identified as a lead agency to develop approaches that inform the federal data ecosystem (National Academies of Sciences, Engineering, and Medicine, 2022).
A flurry of recent legislation in the United States, including the Evidence Act (2019) and the CHIPS and Science Act (2022; hereafter the CHIPS Act) has highlighted the importance of access to data for evidence-building, as well as the importance of metadata so that the data can be found, understood, and used.
Metadata requirements are explicit or implicit in other contexts as well. The recent executive order on the “Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence” (Exec. Order No. 14110, 2023), for example, requires agencies to “develop adequate infrastructure and capacity to sufficiently curate agency data sets for use in training, testing, and operating AI… and enable sound data governance and management practices, particularly as it relates to data curation, labeling, and stewardship.” The CHIPS Act (2022) requires agencies like the National Science Foundation (NSF) to provide evidence about the economic impact of its regional R&D investments. It also charged the NCSES with piloting a National Secure Data Service Demonstration project to “develop, refine, and test models … for a government-wide data linkage and access infrastructure for statistical activities conducted for statistical purposes” (NCSES, 2023a). In addition, both the Inflation Reduction Act (2022) and the Infrastructure Investment and Jobs Act (2021) require evidence about the impacts of their investments. Explicit in all of these is the imperative to enhance workforce education and jobs, which cuts across the missions of the NSF, the Departments of Labor and Education, and state and local government agencies, as well as economic growth (Department of Commerce and the NSF).
The operational infrastructure has been put in place. The Evidence Act (2019) expanded membership of the Interagency Council on Statistical Policy (ICSP) to include statistical officials across major cabinet agencies. In addition, Title II explicitly requires each federal agency to develop and maintain a comprehensive data inventory and provide a clear and comprehensive understanding of the data assets in the possession of the agency. It requires agencies to create the position of chief data officers charged with fulfilling the requirement. The operational arm of government charged with supporting the Chief Data Officers Council’s data inventory efforts, data.gov, is designated “to provide access to government open data to the public, achieve agency missions, drive innovation, fuel economic activity, and uphold the ideals of an open and transparent government.” The Evidence Act also required the establishment of an SAP so that researchers could have visibility into restricted-use data assets across the U.S. federal system and a standardized path for applying for access to those same data assets (National Center for Science and Engineering Statistics, 2023b). The 2020–2023 initial pilot of the Democratizing Data Initiative with U.S. federal statistical agencies to discover federal data set use in research through searching research publication text identified a set of lessons learned; one of them was the need to enhance data set and citing publication metadata so that researchers could easily find data that covers a specific topic or are used in specific fields of research (Madray, 2023).
Some guidelines are under development. There has been a great deal of activity on the part of federal agencies—notably the National Institute of Standards of Technology—in establishing a Research Data Access Framework (Hanisch et al., 2021, 2023). The federal statistical system, which is an important producer of public data, has also been proactive in promoting transparency and discoverability. A recent National Academies report directed at achieving these goals (National Academies of Sciences, Engineering, and Medicine, 2022) noted, however, that many agencies do not have formal guidelines to decide what information to provide to their user communities, and recommended that the ICSP develop and implement a multiagency pilot project to “explore and evaluate employing existing metadata standards and tools to accomplish data sharing, data access, and data reuse,” with NCSES as a participant. The NCSES Democratizing Data pilot, which is described in more detail elsewhere (Lane et al., 2024; Potok, 2024), showed how to find data set mentions in publications and reuse the publication metadata.
The following sections describe in detail how that information could be used to create data set labels.
Classifying anything is hard work. Organizing information from different sources typically relies on a standard taxonomy or set of taxonomies for organizing knowledge. And that requires answering the core classification ‘What do the categories classify?’; a discussion that goes back as far as Aristotle. A taxonomy shared within a community can serve three useful functions (Lambe, 2007, p. 24):
A descriptive function in that it standardizes language, which enables coordination and knowledge-building around the entities or concepts described by that language; a classification function, in that it reveals connections and relationships among different areas of knowledge in predictable, useful, and commonly understood ways; and a sense-making, in that it overlays useful, common structure (or “semantics”) onto the different fields of science.
In sum, analysts and policymakers can use taxonomies to make sense of significant patterns and relationships within and across fields, including identifying gaps in knowledge (Lambe, 2012).
Our proposed framework is based both on the second function—classifying data sets by how they are used by researchers—and the first function—their use of common terms to describe the work that they are doing. This approach is consistent with Aristotle’s writings, his first set of classifications was of beings; the second was objects in the world to which words correspond (Studtmann, 2021). A people-based framework has the advantage of stability; while words and terminologies change quite rapidly, particularly in emerging fields, the work that researchers do tends to be longer term and more persistent (Owen-Smith, 2024).
Statistical agencies have historically used rule-based methods to manually create classifications for entities other than data. Robust taxonomies exist, for example, for industries (Haver, 1997; Yuskavage, 2007) and scientific fields (Galindo-Rueda & López-Bassols, 2022; Organisation for Economic Co-operation and Development [OECD], 2015). Many agencies are moving to machine learning rather than manual approaches to perform classification tasks. In the case of industry classifications, Statistics of Income staff have used predictive supervised models to validate and fill in industry codes as labels (Oehlert et al., 2022). Bureau of Labor Statistics staff have used machine learning to classify occupations and injury types (Measure, 2017, 2023). The U.S. Department of Agriculture has funded research to use machine learning techniques to categorize investment in food safety research (Avery et al., 2022; Klochikhin & Lane, 2017).
Research classifications are typically agency specific. The National Institutes of Health (NIH) has an extensive history of mapping its projects to its Research, Condition, and Disease Categorization (RCDC) codes (Aslan et al., 2023; Talley et al., 2011). The NSF has experimented with text analysis approaches to classify its research portfolio (Klochikhin & Boyd-Graber, 2020). The NCSES, the U.S. statistical agency charged with collecting statistics on science and engineering, while relying on a rule-based system to manually create categories of science, has funded some work with researchers that use award and dissertation text documents to characterize what research has been done (Klochikhin & Boyd-Graber, 2020; Lane & Boyd-Graber, 2014–2017). A White House–led effort to produce cross-agency classifications on awards, Federal RePORTER,2 was discontinued in 2022.
In terms of data classification, the Chief Data Officer Council, in its cross-agency report, made it clear that robust metadata are essential for the discoverability, usability, and governance of federal data (Chief Data Officer Council, 2022b, p. 4). The CDO Council report also notes that much work remains to be done to achieve agency flexibility and interoperability. For example, in the data.gov schema, the guidance suggests that agencies compile values from Theme, Place, Stratum, and Temporal Keywords (such as vegetation, Gulf Coast, Hurricane Katrina) (General Services Administration [GSA], Office of Government and Information Services [OGIS], and Office of Management and Budget [OMB], 2023a). And while the guidance suggests that users may search a data catalog to find data sets related to a specific topic of interest, the approach is largely rule based, suggesting manual tags such as COVID-19, coronavirus, usg-artificial-intelligence, and usg-ai-training-data (GSA, OGIS, & OMB, 2023b). The GSA explicitly states that agencies are encouraged to further include keywords that would improve discoverability, but no specific guidance is provided.
In sum, there is no commonly accepted federal standard for mapping data sets to either agency missions or research fields.
Publishers and analysts of scientific research have developed journal- and publication-level taxonomies and ontologies that serve a different function than labeling data sets. These classification schemes were intended to organize a growing corpus of scientific literature (Garfield, 1963; Gross & Gross, 1927; Price, 2011) but have also been used to analyze scientific relationships among authors.
Three of the most commonly used databases for scientific reference and citation data are Elsevier’s Scopus, Clarivate’s Web of Science, and Digital Science’s Dimensions. Each of these databases includes access to multiple classification schemes, depending on the use case.
Elsevier’s Scopus has several classifications, some of which are proprietary (Singh et al., 2021). Elsevier’s All Science Journal Classification includes four subject areas (Health Sciences, Life Sciences, Physical Sciences, and Social Sciences and Humanities), mapped exclusively to 27 fields and 334 subfields in a three-level hierarchy, with a separate multidisciplinary category.3 Science-Metrix, now part of Elsevier, has their own classification at both the journal and publication levels.4 Elsevier also makes available additional classifications; these include, for example, the Australian and New Zealand Standard Research Classification’s Field of Research classification system;5 the Japanese Society for the Promotion of Science’s grants-in-aid for scientific research program’s category definitions;6 as well as the Frascati Manual of the Organization for Economic Co-operation and Development Fields of Research and Development.7
Clarivate’s Web of Science has a similar classification scheme at the level of journals, in which journals can have multiple classifications.8 Clarivate offers 17 additional research areas that reflect the interests of different countries or organizations;9 these include, for example, the ANVUR Category Scheme,10 China SCADC Subject Categories,11 in addition to Essential Science Indicators12 and citation topics13 (Clarivate, 2023).
In contrast to Scopus and Clarivate, Dimensions does not include their own proprietary classification system. Instead, they have a “series of in-built categorization systems which are used by funders and researchers around the world, and which were originally defined by subject matter experts outside of Dimensions.” The classifications are built using machine learning techniques (Dimensions, 2022).
In sum, there is a proliferation of publishers’ classification systems, but there is potential for the metadata on authors and subjects to be used to map to agency missions and research areas.
An extensive community has built around data discovery for social sciences, driven by the need to document and make survey data discoverable. Iratxe Puebla and Daniella Lowenberg provide an excellent overview in their article in this special issue (Puebla & Lowenberg, 2024), but high-profile examples include the Data Documentation Initiative (Vardigan et al., 2008) and the Dublin Core (International Organization for Standardization, 2019). More broadly, the Research Data Alliance (Berg-Cross et al., 2015; Berman & Crosas, 2020; Parsons, 2013) has developed standards to ensure that data are Findable, Accessible, Interoperable, and Reusable (FAIR) (Wilkinson et al., 2016). And many infrastructures have been built, such as Dataverse (King, 2007) and NIH’s GREI (Generalist Repository Ecosystem) Initiative (Goodey et al., 2022).
In sum, the challenge that is faced by community-driven systems in describing data is that the metadata documentation standards that have been developed describe data by the way they are produced, not by the purpose for which they were produced. This is akin to labeling the cans in the supermarket by the production process, not by whether they can be used to make spaghetti sauce or soup. In addition, the community has struggled to achieve full-scale adoption, and has not developed an approach to developing labels that describe how data are used (Buneman et al., 2020; Hughes et al., 2023).
The NCSES was identified by the National Academies (2022) report as “positioned to be nimble and innovative and to share what it learns with the broader statistical community” as a small and potentially agile principal statistical agency located within the NSF. NCSES is responsible for producing statistical data and analysis in six areas of interest: The Science and Engineering Workforce; Research and Development; Higher Education Research and Development; Government Funding for Science and Engineering; STEM Education; and Innovation and Global Competitiveness.
In support of its mission, NCSES collects data through some 15 major surveys.14 It also produces a variety of reports, including Infobriefs, Infocharts, and Infobytes that make use of this data in order to provide more accessible summary information and integrate data into a larger and meaningful context for stakeholders. Some of those, like the National Science Board’s Science and Engineering Indicators, draw on many different data sets (both internal and external) across several different reports. NCSES’s portfolio of data assets is much broader than its surveys; there are over 2,000 analytical reports that convey information about the science and engineering enterprise to the general public.
In common with many federal agencies, the NCSES serves many different communities with its data; each with its own sets of needs and search criteria. These include the National Science Board; Congress, which mandates the production of certain reports; the research community (which can include hundreds of different disciplines); and many other data user communities. In the last case, many data assets produced by NCSES are relevant to state and local governments, universities, businesses, and academic institutions, which may not rely on such assets solely for purposes of research. There is great practical interest in understanding progress in new fields such as Artificial Intelligence, High Performance Computing, Quantum Technology, Advanced Manufacturing, Cybersecurity, Biotech, Advanced Energy Efficiency, and Material Science (Office of Science and Technology Policy, 2022). The broader data user community uses NCSES data to inform efforts such as program planning, workforce planning, investments, and other activities (National Academies of Sciences, Engineering, and Medicine, 2022). In many ways, then, NCSES is a canonical use case for an automated approach for multiple reasons: the variety and number of NCSES data assets, the heterogeneity of the NCSES user base, and NCSES’s limited staffing resources.
As noted in the previous section, a major challenge is that NCSES mission areas are not scientific research fields as they are conceptualized in any current classification system. They also are not industries, at least not in any way we currently conceptualize them.
To generate data labels, the Democratizing Data project used the All Science Journal Classification (AJSC) (Elsevier, 2023) and Science-Metrix Classification (Rivest et al., 2021) categories (Figure 1a, Figure 1b). These have the advantage of being understandable and grounded in a well-understood framework; the disadvantage again is that they do not map to NCSES’s mission categories. A third approach was to use NCSES’ Taxonomy of Disciplines (ToD), which, while not mapping to NCSES’ mission categories, contextualized the research publications using NCSES data in a meaningful way, using the same definitions of research fields used by NCSES to categorize scientific research areas (Figure 1c). By using the ToD, the data producers (NCSES) could understand which scientific disciplines are using their data in research.
1a |
1b |
1c |
Figure 1. The National Center for Science and Engineering Statistics (NCSES) dashboard from the Democratizing Data pilot illustrating three different approaches to showing topics for users. (1a): ASJC fields; (1b): Science-Metrix subfields; (1c): NCSES Taxonomy of Disciplines. |
But the goal of data labels is to characterize both mission areas and new technologies. Take the broad example of Higher Education R&D, which is a critical NCSES mission area, but which also reflects new technologies, like AI, quantum computing, and the like. Take the case of AI, which has enormous importance for science and engineering policymaking, but that is extraordinarily difficult to categorize (Lane, 2023). Research and development areas like AI are institutional fields: they are recognizable arenas for collaborative and competitive work done by diverse sets of people and organizations in and through evolving networks (Owen-Smith, 2024). Conceptualizing technology areas as research fields and emphasizing points of similarity provides a solid, general basis for designing, training, validating, and explaining a classification model to generate labels using machine learning.
So, one approach to classify data sets into different mission areas or research fields would be to use as features both the terms used by the publications in which they are found and by the authors who do research in the specific areas. Such an approach in the case of AI would be consistent with the operational definition proposed by the AI 100 Year Study Panel, which was “AI can also be defined by what AI researchers do” (Stone et al., 2016, p. 13); mirroring the discussion of the National AI Research Resource Task Force, which also struggled with defining AI (Office of Science and Technology Policy, 2023).
Such a framework would categorize a classification model that includes both the terms that are closely related to AI and people working in AI. A supervised model might be written as in Equation 1.
Here the training data might consist of a set of well-known AI data sets, as well as non AI data sets. In such a case, C would be a binary measure of whether or not a particular data set is known to be used for AI research (Imagenet would be an example of such a data set), T is the set of terms frequently used in publications that use Imagenet, and P would be the AI researchers (like Fei-Fei Li) who have published those articles (Qi et al., 2020).
The advantage of such a classification approach is that most common terms that AI researchers use can change rapidly—for example, from neural networks, to deep learnings, to beta transformers, to large language models—while the people doing the work are much more persistent (Owen-Smith, 2024). Indeed, the term ‘data set’ itself may become archaic because as it is becoming more common, researchers may find more utility in combining or linking data sets produced from different sources, rather than relying on one individual data set to answer a particular research question (Abowd et al., 2004; Chang et al., 2022; Jones et al., 2022). Data producers and data users can, of course, be the same person in such a context, and the distinction is only a function of the way in which they engage with a particular data set during its life cycle.
A key lesson learned from the pilot has been the recognition that the development of taxonomies, nomenclature, and topic categorizations will be an iterative and ever-changing process. The people who use data sets to address particular research questions will change. The terms that they use to describe the data assets of different federal agencies will also change, as will their relationships to each other both as science progresses and as the audience for these data assets grows and shifts. In other words, the people and areas of research will change over time, and the taxonomies need to be flexible. Emerging research also suggests that a search approach that uses people to label fields in addition to terms may be less volatile than an approach based solely on terms approach (Hausen et al., 2023). Just as Aristotle’s first set of classifications was based on classifying beings into groups, it may be that a new framework can be based on the notion that a mission or technology is something that researchers do and is described by the words that they use.
The pilot certainly made it clear that there is no ‘one-size-fits-all’ approach to developing data set labels. Different communities have different interests and ways of communicating; it will be critical to engage all relevant communities, including those who have historically been excluded from the development of classifications.
A possible research agenda would be to employ natural language processing, semantic analysis, and machine learning for a variety of use cases to develop different sets of metadata topic labels for different needs. In the case of individual agencies, a first step could be to generate a set of semantic characterizations of the mission areas of each agency or group of agencies, as well as to identify the scientists who are doing work in each area. The next steps would be to develop a training data set that matches the features associated with each data asset (from the linked publication corpus) and estimate a set of supervised machine learning models to classify data assets into a variety of different categories depending on evolving user needs.
Communities could be involved at all stages of the classification process. Initial seeds could be generated through user group workshops and interviews to capture the appropriate terminology and phraseology used when describing topic areas. This would enable classification of data sets into areas that are recognizable and useful to the user community and is indeed recommended in the National Academies reports. These workshops and interviews could serve a dual purpose of reviewing agency and group missions, stated priorities, charters, and other descriptive text to ensure a comprehensive analysis. Of course, since there is no consensus about the objective truth of many underlying constructs, it is likely that the process will be one of compromise and consensus among experts that all have potentially different opinions on how to structure and demarcate their fields of research. Human–computer interaction tools could be built to expand and automate the community input process.
Government agencies could use the same approach to categorize data sets that can be used to address crosscutting questions about the issues identified in the introduction: food security, climate change, advanced manufacturing, synthetic biology, or artificial intelligence. Even though agencies do not have the billions of dollars that are at the disposal of the private sector, they can combine forces to effect change. Indeed, the Evidence Act established an organizational structure that could be used to take a leadership role in developing such tools. The Office of Management and Budget, in conjunction with the Interagency Council on Statistical Policy, the CDO Council, and agency statistical officials, could initiate pilot projects to test the feasibility of these and other approaches using the types of technologies and tools now available. Such an approach would be fully consistent with the recommendations of the Advisory Committee on Data for Evidence Building (2022). It would create new foundations to the access, use, and reuse of data for evidence-based policymaking and provide a stronger basis for the reproducible and reusable science envisioned by Congress and the Administration (Holdren, 2013; Nelson, 2022).
This material is based upon work supported by the National Science Foundation under Contract Number 49100422C0028. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Abowd, J. M., Haltiwanger, J., & Lane, J. (2004). Integrated longitudinal employer-employee data for the United States. American Economic Review, 94(2), 224–229. https://doi.org/10.1257/0002828041301812
Advisory Committee on Data for Evidence Building. (2022, October 14). Year 2 report. https://www.bea.gov/system/files/2022-10/acdeb-year-2-report.pdf
America COMPETES Reauthorization Act of 2010, 42 U.S.C. § 1862 (2011). https://www.congress.gov/111/plaws/publ358/PLAW-111publ358.pdf
Aslan, Y., Yaqub, O., Rotolo, D., & Sampat, B. N. (2023). Cross-category spillovers in medical research. SocArXiv. https://doi.org/10.31235/osf.io/hpmxd
Avery, D. R., Ruggs, E. N., Garcia, L. R., Traylor, H. D., & London, N. (2022). Improve your diversity measurement for better outcomes. MIT Sloan Management Review, 64(1), 1–6. https://sloanreview.mit.edu/article/improve-your-diversity-measurement-for-better-outcomes/
Berg-Cross, G., Ritz, R., & Wittenburg, P. (2015). RDA Data Foundation and Terminology-DFT: Results RFC. Research Data Alliance.
Berman, F., & Crosas, M. (2020). The Research Data Alliance: Benefits and challenges of building a community organization. Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.5e126552
Buneman, P., Christie, G., Davies, J. A., Dimitrellou, R., Harding, S. D., Pawson, A. J., Sharman, J. L., & Wu, Y. (2020). Why data citation isn't working, and what to do about it. Database, 2020, Article baaa022. https://doi.org/10.1093/databa/baaa022
Chang, W.-Y., Garner, M., Basner, J., Weinberg, B., & Owen-Smith, J. (2022). A linked data mosaic for policy-relevant research on science and innovation: Value, transparency, rigor, and community. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.1e23fb3f
Chief Data Officer Council. (2022a). Data Sharing Working Group: Findings & recommendations. https://resources.data.gov/assets/documents/2021_DSWG_Recommendations_and_Findings_508.pdf
Chief Data Officer Council. (2022b). Enterprise data inventories https://resources.data.gov/assets/documents/CDOC_Data_Inventory_Report_Final.pdf
Clarivate. (2023). Web of Science research areas. https://help.prod-incites.com/inCites2Live/filterValuesGroup/researchAreaSchema/wosDetail.html
CHIPS and Science Act, Pub. L. No. 117–167, 136 Stat. 1366 (2022). https://www.govinfo.gov/content/pkg/PLAW-117publ167/html/PLAW-117publ167.htm
Dimensions. (2022). Which research categories and classification schemes are available in Dimensions? https://plus.dimensions.ai/support/solutions/articles/23000018820-which-research-categories-and-classification-schemes-are-available-in-dimensions-
Elsevier. (2023, September 21). What are Scopus subject area categories and ASJC codes? https://service.elsevier.com/app/answers/detail/a_id/12007/supporthub/scopus/
Emecz, A., Mitschang, A., Zdawczyk, C., Dahan, M., Baas, J., & Lemson, G. (2024). Turning visions into reality: Lessons learned from building a search and discovery platform. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.d8a3742f
Exec. Order No. 14110, 3 C.F.R. 75191 (2023). https://www.govinfo.gov/content/pkg/FR-2023-11-01/pdf/2023-24283.pdf
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174
Galindo-Rueda, F., & López-Bassols, V. (2022). Implementing the OECD Frascati manual (Working Paper No. 2022/03). https://doi.org/10.1787/d686818d-en
Garfield, E. (1963). Citation indexes in sociological and historical research. American Documentation, 14(4), 289–291. https://doi.org/10.1002/asi.5090140405
General Services Administration, Office of Government and Information Services, and Office of Management and Budget. (2023a). DCAT-US Schema v1.1 (Project Open Data Metadata Schema). Resources.data.gov. https://resources.data.gov/resources/dcat-us/
General Services Administration, Office of Government and Information Services, and Office of Management and Budget. (2023b). Field mappings. Resources.data.gov. https://resources.data.gov/resources/podm-field-mapping/#field-mappings
Goodey , G., Hahnel, M., Zhou, Y., Jiang, L., Chandramouliswaran, I., Hafez, A., Paine, T., Gregurick, S., Simango, S., & Peña, J. M. P. (2022). The State of Open Data 2022. Digital Science. https://doi.org/10.6084/m9.figshare.21276984.v5
Gross, P. L., & Gross, E. M. (1927). College libraries and chemical education. Science, 66(1713), 385–389. https://doi.org/10.1126/science.66.1713.385
Hanisch, R., Kaiser, D. L., Yuan, A., Medina-Smith, A., Carroll, B. C., & Campo, E. (2023). NIST Research Data Framework (RDaF): Version 1.5. NIST Special Publication 1500-18r1. National Institute of Standards and Technology.
Hanisch, R. J., Kaiser, D. L., Carroll, B. C., Higgins, C., Killgore, J., Poster, D., & Merritt, M. (2021). Research Data Framework (RDaF): Motivation, development, and a preliminary framework core. NIST Special Publication 1500-18. National Institute of Standards and Technology.
Hausen, R., Lane, J., Lemson, G., & Zdawczyk, C. (2023). A person-based approach to characterizing AI research. Institute for Data Intensive Engineering and Science.
Haver, M. A. (1997). The statistics corner: The NAICS is coming. Will we be ready? Business Economics, 32(4), 63–65.
Holdren, J. P. (2013). Increasing access to the results of federally funded scientific research [Memo]. U.S. Office of Science and Technology Policy. https://obamawhitehouse.archives.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf
Holland, M., & Lane, J. (2018). Policy advisory committees: An operational view. In J. A. Hird (Ed.), Policy analysis in the United States (pp. 173–182). Policy Press.
Hughes, L. D., Tsueng, G., DiGiovanna, J., Horvath, T. D., Rasmussen, L. V., Savidge, T. C., Stoeger, T., Turkarslan, S., Wu, Q., & Wu, C. (2023). Addressing barriers in FAIR data practices for biomedical data. Scientific Data, 10(1), Article 98. https://doi.org/10.1038/s41597-023-01969-8
Inflation Reduction Act, Pub. L. No. 117–169, 136 Stat. 1822 (2022). https://www.congress.gov/117/plaws/publ169/PLAW-117publ169.pdf
Infrastructure Investment and Jobs Act, Pub. L. No. 117–58, 135 Stat. 429 (2021). https://www.congress.gov/117/plaws/publ58/PLAW-117publ58.pdf
International Organization for Standardization. (2019). Information and documentation — The Dublin Core metadata element set — Part 2: DCMI Properties and classes (ISO Standard No. 15836-2:2019). https://www.iso.org/standard/71341.html
Jones, C., McDowell, A., Galvin, V., & Adams, D. (2022). Building on Aotearoa New Zealand’s integrated data infrastructure. Harvard Data Science Review, 4(2). https://doi.org/10.1162/99608f92.d203ae45
King, G. (2007). An introduction to the Dataverse network as an infrastructure for data sharing. Sociological Methods & Research, 36(2), 173–199. https://gking.harvard.edu/files/dvn.pdf
Klochikhin, E., & Boyd-Graber, J. (2020). Text analysis. In I. Foster, R. Ghani, R. S. Jarmin, F. Kreuter, & J. Lane (Eds.), Big data and social science (pp. 193–219). Chapman and Hall/CRC.
Klochikhin, E., & Lane, J. I. (2017). Identifying food safety-related research. In K. Husbands Fealing, J. I. Lane, J. L. King, & S. R. Johnson (Eds.), Measuring the economic value of research: The case of food safety (pp. 69–84). Cambridge University Press.
Lambe, P. (2012). Knowledge organisation systems as enablers to the conduct of science. In A. Gilchrist & J. Vernau (Eds.), Facets of knowledge organization: Proceedings of the ISKO UK Second Biennial Conference 4th-5th July 2011, London (pp. 261–280). Emerald Group Publishing.
Lambe, P. (2007). Organising knowledge: Taxonomies, knowledge and organisational effectiveness. Elsevier.
Lane, J. (2023, June 9). The industry of ideas: Measuring how artificial intelligence changes labor markets American Enterprise Institute. https://www.aei.org/research-products/report/the-industry-of-ideas-measuring-how-artificial-intelligence-changes-labor-markets/
Lane, J. & Boyd-Graber, J. (Principal Investigators). (2014-2017). Scaling insight into science: Assessing the value and effectiveness of machine assisted classification within a statistical system (Project Nos. 1422492, 1422902, 1557745, & 1423706) [Grant]. National Science Foundation National Center for Science & Engineering Statistics. https://www.nsf.gov/awardsearch/showAward?AWD_ID=1422902, https://www.nsf.gov/awardsearch/showAward?AWD_ID=1557745
Lane, J., Spector, A. Z., & Stebbins, M. (2024). An invisible hand for creating public value from data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.03719804
Madray, H. (2023). Standard application process pilot lessons learned report. Resources.data.gov. https://resources.data.gov/assets/documents/SAP_Lessons_Learned.pdf
Measure, A. (2017). Deep neural networks for worker injury autocoding. U.S. Bureau of Labor Statistics. https://www.bls.gov/iif/automated-coding/deep-neural-networks.pdf
Measure, A. (2023). Six years of machine learning in the Bureau of Labor Statistics. In G. Snijkers, M. Bavdaž, S. Bender, J. Jones, S. MacFeely, J. W. Sakshaug, K. J. Thompson, & A. van Delden (Eds.), Advances in business statistics, methods and data collection (pp. 561–572). Wiley. https://doi.org/10.1002/9781119672333.ch24
National Academies of Sciences, Engineering, and Medicine. (2022). Transparency in statistical information for the National Center for Science and Engineering Statistics and all federal statistical agencies. https://doi.org/10.17226/26360
National Center for Science and Engineering Statistics. (2023a). The National Secure Data Service Demonstration project. https://ncses.nsf.gov/about/national-secure-data-service-demo
National Center for Science and Engineering Statistics. (2023b). Standard application process. https://ncses.nsf.gov/about/standard-application-process
Nelson, A. (2022). Ensuring free, immediate, and equitable access to federally funded research [Memo]. White House Office of Science and Technology Policy, Executive Office of the President of the United States. https://www.whitehouse.gov/wp-content/uploads/2022/08/08-2022-OSTP-Public-access-Memo.pdf
Oehlert, C., Schulz, E., & Parker, A. (2022). NAICS code prediction using supervised methods. Statistics and Public Policy, 9(1), 58–66. https://doi.org/10.1080/2330443X.2022.2033654
Office of Science and Technology Policy. (2022). Critical and emerging technologies list update. https://www.whitehouse.gov/wp-content/uploads/2022/02/02-2022-Critical-and-Emerging-Technologies-List-Update.pdf
Office of Science and Technology Policy. (2023). Strengthening and democratizing the U.S. artificial intelligence innovation ecosystem. https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf
Organisation for Economic Co-operation and Development. (2015). Frascati manual 2015: Guidelines for collecting and reporting data on research and experimental development. OECD Publishing. https://doi.org/10.1787/9789264239012-en
Owen-Smith, J. (2024, March 18). Will the real AI researcher please stand up? Fields, networks and systems to measure the impact of research investments [Conference session]. Workshop on New Approaches to Characterize Industries: AI as a Framework and a Use Case, Stanford, CA, United States. https://www.digeconevents.com/
Parsons, M. A. (2013). The Research Data Alliance: Implementing the technology, practice and connections of a data infrastructure. Bulletin of the American Society for Information Science and Technology, 39(6), 33–36. https://doi.org/10.1002/bult.2013.1720390611
Potok, N. (2024). Data usage information and connecting with data users: U.S. mandates and guidance for government agency evidence building. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.652877ca
Price, D. d. S. (2011). Networks of scientific papers. In The structure and dynamics of networks (pp. 149–154). Princeton University Press.
Puebla, I., & Lowenberg, D. (2024). Building trust: Data metrics as a focal point for responsible data stewardship. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.e1f349c2
Qi, L., He, Q., Chen, F., Zhang, X., Dou, W., & Ni, Q. (2020). Data-driven web APIs recommendation for building web applications. IEEE Transactions on Big Data, 8(3), 685–698. https://doi.org/10.1109/TBDATA.2020.2975587
Rivest, M., Vignola-Gagné, E., & Archambault, É. (2021). Article-level classification of scientific publications: A comparison of deep learning, direct citation and bibliographic coupling. PLoS ONE, 16(5). https://doi.org/10.1371/journal.pone.0251493
Singh, V. K., Singh, P., Karmakar, M., Leta, J., & Mayr, P. (2021). The journal coverage of Web of Science, Scopus and Dimensions: A comparative analysis. Scientometrics, 126, 5113–5142. https://doi.org/10.1007/s11192-021-03948-5
Stone, P., Brooks, R., Brynjolfsson, E., Calo, R., Etzioni, O., Hager, G., Hirschberg, J., Kalyanakrishnan, S., Kamar, E., Kraus, S., Leyton-Brown, K., Parkes, D., Press, P., Saxenian, A., Shah, J., Tambe, M., & Teller, A. (2016). Artificial Intelligence and Life in 2030. One Hundred Year Study on Artificial Intelligence: Report of the 2015-2016 Study Panel. http://ai100.stanford.edu/2016-report
Studtmann, P. (2021). Aristotle’s categories. In E. N. Zalta (Ed.), The Stanford encyclopedia of philosophy. https://plato.stanford.edu/archives/spr2021/entries/aristotle-categories/90
Talley, E. M., Newman, D., Mimno, D., Herr, B. W., Wallach, H. M., Burns, G. A., Leenders, A. M., & McCallum, A. (2011). Database of NIH grants using machine-learned categories and graphical clustering. Nature Methods, 8(6), 443–444. https://doi.org/10.1038/nmeth.1619
Vance, A. (2011, April 14). This tech bubble is different. Bloomberg.com. https://www.bloomberg.com/news/articles/2011-04-14/this-tech-bubble-is-different#xj4y7vzkg
Vardigan, M., Heus, P., & Thomas, W. (2008). Data documentation initiative: Toward a standard for the social sciences. International Journal of Digital Curation, 3(1), 107–113. https://doi.org/10.2218/ijdc.v3i1.45
Wilkinson, M. D., Dumontier, M., Aalbersberg, Ij. J., Appleton, G., Axton, M., Baak, A., Blomberg, N., Boiten, J.-W., da Silva Santos, L. B., Bourne, P. E., Bouwman, J., Brookes, A. J., Clark, T., Crosas, M., Dillo, I., Dumon, O., Edmunds, S., Evelo, C. T., Finkers, R., … Mons, B. (2016). The FAIR Guiding Principles for scientific data management and stewardship. Scientific Data, 3(1). https://doi.org/10.1038/sdata.2016.18
Yuskavage, R. E. (2007, November 5–7). Converting historical industry time series data from SIC to NAICS [Paper presentation]. Federal Committee on Statistical Methodology 2007 Research Conference, Arlington, VA, United States. https://www.bea.gov/system/files/papers/P2007-7.pdf
©2024 Christina Zdawczyk, Julia Lane, Emilda Rivers, and May Aydin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.