Skip to main content
SearchLoginLogin or Signup

Turning Visions Into Reality: Lessons Learned From Building a Search and Discovery Platform

Published onApr 02, 2024
Turning Visions Into Reality: Lessons Learned From Building a Search and Discovery Platform


The Democratizing Data Initiative looks to demonstrate the impact of the data assets produced by U.S. federal agencies. In this article we outline the process flows and cyberinfrastructure that support the achievement of that objective. We describe how the focus on data assets evolved from the initial search for data sets to the challenges of finding how the data assets have been used by the research community. In this regard, we explain both why we need to define a search corpus of full-text records and the process by which that search corpus is created. We then explain the process of applying machine learning (Kaggle) algorithms to that full-text search corpus and the further steps we take to refine the outputs generated by those algorithms, including the manual validation by subject matter experts that we employ to minimize false positive data set identification in the final outputs. Once validation is complete, we explain how the outputs are made available to diverse stakeholders—agency staff, funding partners, general public—through a variety of channels including REST APIs, dashboards, SciServer, Jupyter Notebooks, and direct database access. We conclude by looking at potential future enhancements to the process and possible research directions.

Keywords: data assets, impact, full-text search, machine learning, Kaggle

Media Summary

A challenge for most if not all federal agencies is how to demonstrate the impact of their work and the value they have provided in their use of taxpayers’ money. A specific example of this concerns the usage that is made of the data that such agencies generate in the course of their work, for example, either as a result of surveying their communities or from processes of automated collection of transactions and other data points. This data is made available by agencies in a variety of ways, including as raw data or as reports, working papers, or even newsletters. We describe these collectively as ‘data assets.’

While the impact and usage of such data assets will be distributed across many different stakeholders in society, one important stakeholder group is the research community. In this article, we describe the process we have employed as part of the Democratizing Data Initiative to define data assets and to demonstrate the use of those data assets by that research community. Our focus has been data assets from selected U.S. federal agencies. We explain the data source that forms the base of the search and why it is necessary to search full text in order for us to be sure we are capturing usage. We use the term ‘search corpus’ to define the population of records we will search.

Once we have the defined the search corpus, we first apply machine learning algorithms and then undertake some further refinements to ensure the accuracy of our reported outcomes. A critical step in this is the use of subject matter experts to validate the results. They will specifically look for errors in the outputs, such as records the model has erroneously considered to be one of the data assets we are interested in.

As part of the process, we have provided agency staff and their communities with a range of tools by which they can interrogate and explore the output from our models. This helps build trust and confidence in the results but more importantly also ensures that maximum use can be made of the results we have generated.

Finally, we conclude by looking at further improvements that we may be able to make to the process to improve the accuracy and utility of the results we are generating.

1. Introduction

The Democratizing Data Initiative, at its core, concerns the provision of information about the usage of U.S. federal data assets across various users of these assets. The rationale that underlies the interest in the initiative is explored in more detail in other articles in this special edition (see Nagaraj, 2024; also, Lane et al., 2024) but will be briefly summarized here.

Nagaraj describes data as the “lifeblood of decision-making in the digital age” and notes that public data (data collected, funded, or made public by government institutions) represents a key resource for society. He notes the reliance of the research community on publicly available, large-scale, centralized data sets to discover new insights in many research fields.

In generating public data, agencies have to balance the value of the data with the costs associated with generating it. However, as Nagaraj also notes, valuing public data represents a particular challenge for a number of reasons, including the difficulties in tracing who has used the data, because the ‘chain reaction’ in a decision can make it difficult to determine which data has informed the decision and because of the diverse user base who may use the data.

Despite such challenges, there are new opportunities to develop better signals about public data and new motivations to do so. Lane et al. (2024) describe the legislative mandates that provide an impetus for agencies to ensure such opportunities are explored and taken. They note specifically that Title II of the Foundations for Evidence-Based Policymaking Act (2019) requires agencies to, among other things, provide information about how their data are used by the public, to produce evidence, and to create responsible officials. Researchers and publishers face similar legislative mandates. For example, in August 2022, the White House Office of Science and Technology Policy (OSTP) issued the “Nelson memo” providing explicit timelines for federally funded scientific researchers and the publishers of federally funded research to share publicly funded data and research.

In this article, we describe the process that has been developed to identify signals regarding the value or impact of public data within the science and technology research community. In particular we look to understand the reach of the data assets within the community both in terms of numbers of researchers using the data in the research outputs (i.e., article, book, conference paper, etc.) and also the areas of interest of those researchers (i.e., as evidenced by the research topic within which a relevant research output was classified). Specifically, this article focuses on the technical processes that have been employed in what we consider the first phase of the Democratizing Data Initiative, a phase that is designed to enable us to improve and enhance those technical processes. The article therefore describes the process workflows and cyberinfrastructure—the software, computer systems, and data flow—that form the foundation of the Democratizing Data Initiative. We hope that the work described here generates interest in improving the approach and have ourselves identified possible areas for further research.

In running the process, the technical objective is to find a data asset–publication pair—a dyad—that links a data asset to a research publication in which it is mentioned. Once a dyad is identified, it is possible to then link in the metadata from the research publication in which it is mentioned.

An overview of the process is provided in Figure 1. The subsequent sections are structured to reflect the main process steps, specifically ‘data set definition’ (Section 2), ‘search corpus generation’ (Section 3), ‘search’ (Section 4), ‘validation’ (Section 5), and ‘production API and production database with agency specific results’ (Section 6).

We describe the cyberinfrastructure that has been developed to facilitate these goals in a manner that is computationally efficient, flexible, and accountable as production flows with feedback loops from each partner agency at critical points in the process. Finally, we turn to the current limitations of this process and our vision for next steps.

Figure 1. Process overview diagram showing the flow of information between functional units. The yellow outlined group at the top is work-in-progress at the time of writing. The dotted group below signifies a section that would typically be undertaken periodically in a production environment. The dotted edges represent steps that are desired outcomes, but that at the time of writing have not yet been implemented.

2. Process Description: Data Sets Definition

The data flow starts with a partner agency describing data assets of interest (‘target data assets’). Agencies provide both canonical names of data assets and known alternate names for the same data asset (within this context, referred to as an ‘alias’). We describe the evolution in approach here.

In the early projects of the first phase, our focus concerned searching for data sets (i.e., collections of data) that were specifically selected by the agencies as being of interest. While each of these data sets tended to have a formal ‘canonical’ name, agency staff emphasized that, in practice, a range of alternatives (including acronyms) were often employed by the data set users. We termed these alternatives ‘aliases’ and included them in our search. To take two examples:

  • For the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA), the canonical named data set “NASS Census of Agriculture” was included with aliases “USDA Census of Agriculture,” “USDA Census,” “Census of Agriculture,” “Agricultural Census,” and “AG Census.”

  • For the National Center for Science and Engineering Statistics (NCSES) of the National Science Foundation (NSF), the canonical named data set “Higher Education R&D Survey” was included with aliases “Higher Education R&D,” “Higher Education Research and Development Survey,” “R&D Personnel at Colleges and Universities,” and “HERD.”

Following these first projects, and based on the reviews conducted by agency staff of the search results that were being obtained, it was noted that in many cases the users of the data sets might not be referencing the actual data set but, instead, could be referencing materials (such as reports, working papers, and news items) that directly derived from and either presented or summarized the data sets. For example, the data set “Higher Education R&D Survey” could be referenced by the community from various NSF ‘infobriefs’ and ‘infocharts’ from working papers such as “Trends in the Relationship between U.S. Academic Scientific Publication Output and Funding and Personnel Inputs: 1988–2011” as well as from the survey’s data tables themselves.

This realization, based on the practical experiences associated with search outputs, resulted in our introducing a new concept, that of a ‘data asset,’ and to extend the search process to include this concept. In this regard the term ‘data asset’ was deemed to include direct references to the underlying data in the data set as well as to other published material that users could employ when using that data set.

In addition to the canonical data asset names and aliases, the partner agency also provides any other parameters that may be used to define the later search routines, in particular identifying the analytical period of interest. Typically, searches were conducted for either the period 2012 to 2022 or for the period 2017 to 2022.

3. Process Description: Search Corpus Generation

3.1. Conceptual Overview

Once the data assets and aliases have been defined, we need to identify a set of documents within which to conduct the final search for data asset mentions (i.e., the ‘search space’). The core data sources employed for conducting the search are ScienceDirect and then Scopus (described later in this section).

The first step is to create a seed corpus, and then use the seed corpus to understand and infer the types of publications in which these data assets may be most likely to be mentioned to create the final search corpus. The search space represents a subset of the records that would theoretically be available to search.

There are several factors that motivate the restriction of the search space. First, the traditional bibliometric approach to measuring research ‘use’ leverages standardized reference lists and citations to understand the impact of previous work on another research project. However, while an increasing number of research publications in certain fields are citing data sets in a standard way, this has not been adopted as uniform practice in the research community and therefore searches of publication metadata (e.g., the publication reference list) alone is not sufficient to assess use and a much more labor- and compute-intensive full-text search is required in order to ensure that adequate coverage is achieved (Khan et al., 2021).

Second, the search space is currently also restricted due to the computational challenge associated with the intensity of full-text search, given the large number of documents for which full text is available. In its 2022 roadmap, Scopus has indexed more than 87 million items from more than 7,000 publishers (McCullough, 2022). When assessing the subset from the last decade, it can be seen that a large majority have the full text available for searching for mentions of data assets. As such, the number of documents with full text that can be searched for this project’s purposes are too large for current algorithms to run through efficiently and in a timely manner. Steps are therefore taken to restrict the search space.

Third, the search space is restricted to balance recall (our aim to find as many occurrences of the data set as possible) and precision (our aim to accurately identify occurrences and generate results amenable to human validation). A specific objective in creating the search corpus is to reduce the size of the search space not just to be smaller, but to also include the most relevant text for the given problem. This is important to reduce the rate of false positive identifications, particularly where data asset names or aliases are short or could otherwise be ambiguous.

Table 1 illustrates the point by contrasting two of the aliases in our search that were associated with the NCSES Higher Education R&D Survey.

Table 1. Illustrative example of a data asset with validation result.

Data Asset / Alias name

Number Actively Validated

Number found to be a True Positive

Higher Education Research and Development Survey






The HERD example clearly shows the difficulties associated with searches that include a short acronym or which is a homograph for a commonly used but different concept. Of the 1,308 matches that were validated by subject matter experts, only 8 (0.6%) were found to be a true positive. In contrast, the subject matter experts found all the matches associated with the “Higher Education Research and Development Survey” to be true.

In addition to the issue of false positives, the table also demonstrates the level of work required for validation. No matter how attractive achieving high levels of recall may be, that objective rapidly becomes prohibitive in terms of both cost (subject matter expert effort) and time in those cases where validation of large numbers of individual dyads is required.

3.2. Overview of Using Scopus as the Core Data Source

Having illustrated the necessity to build a search corpus that can balance recall, precision, cost, and time, we can turn our attention to describing Scopus—the main data source employed for the search corpus.

Scopus is a large, curated, generalist abstract and citation database of scientific literature including scientific journals, books, and conference proceedings (Aghaei Chadegani et al., 2013; Oliveira et al., 2018). Elsevier reports that on average around 11,000 new records are indexed every day (McCullough, 2022). The main alternative to Scopus for those engaging in citation analysis is Clarivate’s Web of Science (WoS) and both Scopus and WoS are widely used for bibliometric analysis. There are, of course, limitations associated with the coverage of these abstract and citation databases (see Archambault, & Larivière, 2010; Mongeon & Paul-Hus, 2016; Science-Metrix 2016). Some key points are highlighted here:

  • Focus on peer reviewed sources. Both the WoS and Scopus primarily index scholarly and research-related documents, most of which are peer reviewed. For the current project within Democratizing Data, with its scope linked specifically to impact as measured by use within the research community, this limitation is less of a concern than would be the case with other communities.

  • The databases tend to focus predominantly on journals and with less coverage of other means of scientific knowledge diffusion such as books, conference papers, and nontraditional research outputs (reports, live performances, recorded work, events, etc.). This focus means there may be a bias in coverage toward those disciplines (natural sciences, health sciences, engineering) that favor journals as a dissemination channel. The social sciences produce a greater proportion of publications that are not journal articles—especially in books. This phenomenon is even more pronounced in the humanities.

  • Regional coverage (and focus on English). Another consideration is the more local orientation of research in the social sciences and arts & humanities (SSH). As a result, the target readership in these subjects is more often limited to a particular country or region. Consequently, SSH scholars tend to publish more frequently in a language other than English—and in journals with a national distribution rather than international distribution.

For the Democratizing Data project, all source types (serial and nonserial) and all document types (Elsevier, 2023b) within Scopus are included in the analysis. In this article, and in the Democratizing Data project generally, these types are often collectively referred to as either publications or research outputs.

In terms of the search, all the data fields in the commercially available Scopus product are available to use, including titles, abstracts, and, importantly, the reference section of the research records. In addition, the Scopus database contains data that is not available in the commercial product but is available (through ownership or license) and can be used for specific use cases, including those of the Democratizing Data Initiative. Scopus draws on Elsevier’s own publishing of research, but also benefits from contributions by over 7,000 publishers, many of which license full text to Elsevier for defined periods of time for searching and indexing important information that is used in research and evaluation of research. There are some licensing restrictions associated with the full-text search, for example publications by Springer Nature, amongst others, are excluded from these full-text search routines.

Table 2 provides a summary of full-text records in Scopus that are available and searchable based on the calendar year of publication during the period 2017 to 2021. While full-text records are available for earlier years, a smaller share of Scopus-indexed documents has full-text records available as one goes back in time. This table illustrates the possibilities in the search space for what have, to date, been found to be the typical periods of interest from the agencies.

Table 2. Full text records available and searchable based on the calendar year of publication during the period 2017 to 2022.

Calendar Year

Scopus Records

Records With Full Text

Full Text Records Licensed and Available for Search

Licensed as % of Full Text Records

Licensed as % of Scopus Records





































3.3. Generating the Seed Corpus

The process of creating the search corpus begins with the creation of a seed corpus. The purpose of the seed corpus is to define the parameters that can be used for creating the final search corpus. We create the seed corpus by text matching the canonical name for each data asset of interest and its aliases with full-text records in ScienceDirect that are within the agency-defined analytical period. ScienceDirect is a bibliographic database of scientific and medical publications, hosting over 19 million articles from more than 2,650 peer-reviewed journals and 43,000 ebooks (Elsevier, 2022). ScienceDirect is used to create the seed corpus because it represents a large source of systematically well-structured full-text research publication records and these records are readily available to the project team. The search itself involves running a notebook on Databricks, which distributes the search on a multiworker spark cluster. Each worker is a machine on the Amazon Web Services (AWS) cloud.

The search through ScienceDirect results in a set of research publication records potentially matched to the target data asset names and aliases. The metadata associated with these publications provides insight into what types of research are leveraging these data assets. Specifically, the ‘Topic’ of the research publication can be informative in understanding which fields are most likely use these data assets.

Topics are a construct used by Elsevier for creating a collection of research outputs with a common intellectual interest (Klavans & Boyack, 2017). Topics can be large or small, new or old, growing or declining. A Topic is defined as being where the direct citation linkages within the Topic are strong and the direct citation linkages outside the Topic are weak (Elsevier, 2023a). Once a Topic publication set is identified based on citation linkages, text is analyzed for key phrases and the key phrases are ordered based on relevance within the publication set. Then, a term frequency–inverse document frequency (tf-idf) method (a natural language processing technique that is used to quantify a word’s relevance in a set of documents) is used to extract the top 3 key phrases. Illustrative examples of Topics that were identified during the establishment of the NCSES data set search are shown in Table 3.

Table 3. Topics identified in seed corpus for NCSES data assets.

Topic Name

Topic ID

Entrepreneurial University; Academic Entrepreneurship; University Technology Transfer


Co-Authorship; Scientific Collaboration; Scientometric


Female Scientist; Research Productivity; Woman In Science


Career Adaptability; Career Decision Self-Efficacy; Social Cognitive Career Theory


Scientific Workforce; Career Outcome; Organized Financing


Intellectual Structure; Co-Citation Analyse; Scientometric


Patent Holder; Patenting; Intellectual Property Right


Over time, new Topics will surface, and as Topics are dynamic, they will evolve. To illustrate this and using the examples in the table above, the Topic with Topic ID 712 has subsequently been renamed as “Entrepreneurial University; Academic Entrepreneurship; Innovation” and the Topic with Topic ID 2681 is now “Co-Authorship; Scientific Collaboration; Bibliometric Analysis.”

In using Topics to restrict the search space in this way, the assumption being made is that each Topic is a proxy for a community of researchers who cite each other and work on a common intellectual interest, and are thus likely to use similar other resources, including data assets.

Once the Topics associated with the seed corpus records have been identified, the list of Topics are restricted to those with counts of five or more records. We have employed this restriction as an ad hoc way of minimizing further the possibility of ‘false positive’ matches being generated for a data asset. We have assumed that false positives will be correlated to those Topics that had only a small number of records found in the text search of ScienceDirect. The candidate Topics are then reviewed by both the project team and the agency (to varying degrees, depending on the agency) to assess their relevance to the target data assets. It is this list of agreed Topics that is then used to define the search corpus in Scopus.

The approach we have used so far has been driven by pragmatic considerations based on a number of assumptions. We recognize that it lacks a formal underpinning evidence base. As we move forward, we would propose to undertake additional research to test whether our assumptions were indeed valid. In particular, we would like to explore the following questions:

  • To what extent does using a seed corpus and research topics from ScienceDirect help to reduce the number of false positives in the search corpus?

  • Assuming such a seed corpus is needed, what filter in terms of research records would best balance the false positive rate with the total number of mentions identified.

3.4. Defining the Search Corpus

Using Scopus, two search corpora are employed in the subsequent search routines.

  • Full-text search. Full-text records in the Scopus database that satisfy all the following conditions, i) comply with license conditions, ii) are associated with one of the topics identified in the seed corpus, and iii) comply with other agency defined parameters.

  • Reference search. The full Scopus data set (with no topic filtering) using the data range specified by the agency and any other parameter that has been predefined (e.g., include only those records with at least one U.S.-affiliated author).

4. Process Description: Search Routines

The technical process underlying the machine learning algorithms that are employed for conducting the search of the full text is explored in detail in a separate article within this HDSR special edition (Hausen & Azarbonyad, 2024). In this section, a summary of the models will be provided, but the focus here concerns the search process and the outputs that are generated from that process.

4.1. Searching the Full-Text Corpus

The search of the full-text search corpus employs three open source machine learning models, originally developed as part of the Kaggle competition “Coleridge Initiative - Show US the Data” (Coleridge Initiative, 2021a; Lane et al., 2022). The models are all available in the Coleridge GitHub repository.

  • Model 1 (Deep Learning - Sentence Context). This model’s approach is to use a deep learning-based approach to learn what kind of sentences have references to a data asset. Model developed by Nguyen Tuan Khoi and Nguen Quan Anh Minh (Coleridge Initiative, 2021b).

  • Model 2 (Deep Learning - Entity names). This model extracts names of entities from the text and uses a deep learning approach to classify an entity as being a data asset or not. Model developed by Chung Ming Lee (Coleridge Initiative, 2021c).

  • Model 3 (Pattern Matching). This model takes a rule-based approach to search for patterns in the document that are similar to a list of existing data assets. Model developed by Mikhail Arkhipov (Coleridge Initiative, 2021d).

The infrastructure used for running the search process is the same as for the seed corpus search, that is, we run a notebook on Databricks, which distributes the search on a multi-worker spark cluster. Each worker is a machine on the AWS cloud.

It is worth emphasizing that Models 1 and 2 identify generic data assets (i.e., they do not take as their input a list of data assets to search, but rather identify entities consistent with being a data asset). Model 3 identifies both specific target data assets (based on the input) and also generic data assets. More detail is provided in the associated paper by Hausen and Azarbonyad (2024).

The search employs all three of these models. In addition to the identified data asset, each model generates a score that reflects the certainty the model has about the identified mention. The generation of the score is built into each algorithm. We do not apply any thresholds with regard to the scores, but rather ingest the full output of candidate data assets found by the algorithms. 

The specific text identified by the models as being a candidate data asset is extracted and stored. The models do not, themselves, record the location of the candidate data asset within the full-text document. However, using the identified text, and where licenses allow, the team generates a ‘data snippet’ from the full text showing 235 characters immediately before and after the candidate data asset text string. These snippets are used in a later step to manually validate whether a match is a true match or a false positive.

Given that each full-text record in the search corpus is searched by each algorithm, there is a range of logical possibilities for each full-text record searched, as follows:

  • No data asset found in the record.

  • Single data asset reference found.

  • Multiple references to a single data asset found.

  • Single reference to multiple data assets found.

  • Multiple references to multiple data assets found.

As indicated, given that two of the models focus entirely on finding generic data assets, many of the data assets found in this initial search will be referring to nontarget data assets. At this stage, the matches represent only ‘candidate’ matches. The next step, therefore, is to identify which of the references are to target data assets defined by the client. This is achieved by applying a fuzzy text matching algorithm using the agreed target data asset names and their aliases. 

The fuzzy matching algorithm used is an open source package called FuzzyWuzzy developed in Python (Seatgeek, 2020). The fuzziness allows for syntactic differences between the data assets and candidate mentions, while capturing the intention of the authors, thereby increasing recall. As with the machine learning algorithms, a score is generated for each identified match (i.e., a candidate detection and a target data asset). The score is considered a match when it exceeds a threshold value. That threshold value is not fixed and is adjusted depending on the types of aliases that have been provided. For shorter aliases, where more matches will be generated, the threshold is set higher. The threshold employed typically has a value between 0.65 and 1.

The fuzzy text process results in a set of dyads being identified (see the Introduction of this article for dyad definition). The logical possibilities described above for the machine learning algorithms also apply here. For example, following the fuzzy search, a given research output may contain three potential matches to the same target data asset and this would generate three dyads. Likewise, a given research output, with matches to two different target data assets and also to an alias of one of those data assets, will also result in three dyads.

Where data asset aliases are short acronyms or very generic in nature, this fuzzy matching process can result in a high number of false positives. To reduce that possibility, we have already described that we adjust the threshold value. We can also include an additional filter, that is, fuzzy matched short acronyms are only included in the output if they are associated with an additional text ‘flag,’ such as the name of the agency responsible for that data asset. Where it has been decided to include a flag term, the full text of the record is again searched.

4.2. Searching the References

A search through the references list of each Scopus record is undertaken as a separate and distinct step from the full-text search. The search corpus here is broader than for full text, as there are no license conditions restricting the search. In addition, because references contain highly structured data, it becomes feasible to search through all of Scopus, as the computational limitations of full-text search do not apply.

Publication references are searched using a string-matching process to check for individual data asset names and their aliases. As for the full-text search, where the data asset name or alias is a short acronym, a possible match is only registered if the acronym is also accompanied by an additional text ‘flag,’ such as the name of the agency.  For reference search matches, the data snippet is the actual citation itself.

Table 4 provides an illustration of the results generated from the reference search for the National Center for Science and Engineering Statistics of the National Science Foundation.

Specifically, this illustrates that when reference searching, we included more than 27 million Scopus records (i.e., those from 2015 to 2022 inclusive) and this resulted in us finding a further 5,373 publications that contained potential matches to the target data assets. It also returned 679 publications found by the Kaggle and fuzzy matching process. To put this in context, the Kaggle and fuzzy matching processing itself returned 31,148 records, so adding the reference search increased coverage by some 17%.


Number of publications for which references were searched

Number of publications found with target data asset that were also identified in Kaggle and Fuzzy match process

Number of these publications that were not also found by Kaggle / Fuzzy match process (i.e. unique to the reference search)





5. Validation

The output of the search routine is a file that links a Scopus record to the data asset mentioned (dyad). Additional data, including the algorithm that identified the mention; a Kaggle algorithm score (where relevant); the data ‘snippet’ (short excerpts of text surrounding a candidate data asset mention); and fuzzy text score (where relevant) are also included in the output. 

5.1. Dyad Validation

The Kaggle/fuzzy text search of the full text will have resulted in a set of publications with at least one dyad (candidate match) to a target data asset. Assuming no licensing restrictions apply for the publication, snippets will be generated and displayed for all these dyads. In addition, for these publications, snippets are also generated for any other data assets identified by the Kaggle models, that is, for co-occurring data assets.

The nature of the data assets and aliases employed together with the machine learning approaches used mean that the set of output dyads will inevitably contain a number of false positive detections. To provide quality data, a human process of validation is carried out prior to publishing the results for consumption. The key to the success of this step is to engage suitable subject matter experts. Inevitably, there can be considerable ambiguity in the data set names even where the models have been very accurate in matching terms. For example, a data asset name such as ‘Annual Business Survey’ is likely to be replicated across many non-U.S. agencies. Many cases will not be relevant. Subject matter experts thus need to understand the context associated with the research output within which a dyad has been identified.

The validation additionally provides feedback to the machine learning models by expanding the set of known dyads (and excerpts) to train models against.

The human reviewers are asked to review individual snippets alongside the mention candidate, as follows:

  • Is the mention candidate within the snippet text a mention of a data asset? This addresses the goal of the dyad search process: to find mentions of data asset within full text. The user is shown the full snippet with the mention candidate highlighted.

  • Is the mention candidate referring to the input data asset or alias that was matched? This essentially provides feedback for the string fuzzy matching algorithm in that it determines if that was done correctly. Those that receive a positive response to this question are those of interest to the end user. A positive response to this also implies a positive response to the previous question (e.g., it cannot be referring to the input alias if it is not itself a mention of a data asset).

While the first question is relevant only to the machine learning models, validations are only carried out on those snippets for which a matched alias/data asset exists. The principal motivation is to limit the required work, and also the lead time, associated with producing the result of interest to stakeholders. If sufficient resources were available, the validation set could be expanded to cover all identified mention candidates (skipping the second question in cases where not relevant).

With sufficient confidence in the machine learning models, we can envision a state where only a randomized portion of the data are submitted to validation, or an adaptive process where the results of such a portion of validations are used to gate decisions on further expanding the set of validation.

However, we have not yet reached the level of confidence where randomized samples can be taken for validation and, from an agency perspective, a key question remains: How much validation should we undertake? We intend to explore that question by assessing the volumes of matches that are generated at different model thresholds.

5.2. Result Validation and Feedback

Validation serves two major purposes: to filter the set of automated detections on the results of interest for consumption by stakeholders and other data consumers; and to provide feedback to further tune machine-learned classifiers in order to produce results with lower false-positive rates.

Satisfying the first purpose would seem to fall quite naturally from the results of validation, however, there is a small but significant detail that must be considered: Validation operates on individual snippets of text associated with a publication, mention candidate, and alias, while the output available to consumers is the more general pair matching publication to alias. The two principal reasons for this are:

  • Snippets (and full-text in general) are licensed and thus can generally not be shared with the end-user, and

  • The search models (those in use at the time of writing) have been created in such a way to identify the mention of a data set within the text, but not individual locations of such.

These align with the overarching goal of identifying usage of data, and not specifically the mentioning of it or the context in which it was mentioned. The mentioning simply serves as a proxy for usage in the current implementation. In the current approach, we consider each publication-alias pair where both questions were answered affirmatively to be a valid mention. This enables the end user to count the number of valid mentions (though this is not guaranteed to be exhaustive) within a publication, but also requires deduplication when constructing particular statistic on the data.

At the time of writing there have been over 55,000 validations performed by members of the team or subject matter experts from the various partner agencies. This in and of itself forms a valuable data set on which to train text-based models in general. Within those, ~20% have been marked as false positives (e.g., the identified mention was not determined to be a mention of a data set), ~75% true positives, while the remainder are ambiguous. This indicates that there is also value in using the validation data as feedback into the models used for this project.

Some work has gone into improving the performance of the text search, however, the infrastructure for automating the feedback (dashed feedback loop containing ‘Retrain models’ in Figure 1) has not been constructed. Even for a hypothetically very well-performing model, in a production system there is most likely a case for spot checking using this validation process and incorporating back into the models via retraining periodically, and this is a future direction that we plan on exploring.

5.3. Machine Learning Model Validation

In addition to engaging the community in validating the dyads, we intend also to engage the community in the validation of the models themselves. The infrastructure and approach for this (and use of a sandbox environment) is described in Section 6.5.

Within this controlled environment, it is possible for selected users to explore the results the team has generated and to run the models on a data extract from ScienceDirect. ScienceDirect is used here because, unlike with Scopus, the intellectual property for the full-text content either already resides with Elsevier or is confirmed as usable in this sandbox environment. We fully recognize the positive contributions in terms of technical sophistication and trust to be made in model development by engaging experts in the community—indeed, the present models were generated via an open community effort hosted on Kaggle. We remain committed to developing these capabilities going forward.

6. Production Process: Production API and Production Database With Agency Specific Results

The metadata associated with the dyad publications represents the core output of the data asset search. The metadata is generated from Scopus using the Publication ID. The metadata includes details of the publication source (e.g., journal name), the named authors, the organizational affiliations, the research topic, the citations that the publication has received, and the Field Weighted Citation Impact (FWCI). FWCI is an article-level, field-normalized citation metric that is used to compare the academic impact of a paper (Purkayastha et al., 2019). In addition, specific indicators of interest to agencies can be included. Full details of the metadata provided are available in the Democratizing Data technical support material (Baas & Lemson, 2024).

We provide several modes of data access to enable high level understanding of the usage of data addressing specific questions all the way through deep understanding via arbitrarily complex querying. The goal is to offer access to information via different tools to cater to the needs of different use cases and users with different technical skillsets.


Application programming interfaces (APIs) have emerged as pivotal tools in the scientific community, enabling seamless data sharing, fostering collaboration, promoting transparency, and accelerating innovation (Qi et al., 2022). APIs provide a standardized interface to access, retrieve, and merge data from disparate sources, resulting in comprehensive data assets for analysis. The team provides a REST API to access the data generated by the search corpus. A REST API, which stands for representational state transfer application programming interface, serves as a widely accepted and standardized method for facilitating communication between different software applications over the internet. At its core, it adheres to principles rooted in simplicity and statelessness, making it a preferred choice for enabling seamless interaction between clients, such as using web browsers or mobile apps and remote servers. By disseminating Democratizing Data information via an API, we provide the ability to ensure production-quality data is available to various stakeholders to meet their needs.

The Democratizing Data API endpoints play a pivotal role in promoting data sharing and integration, thereby empowering stakeholders to derive novel insights and discoveries that may not be attainable using isolated data sets. The API acts like a digital bridge—connecting pieces of information together. On the back end of the API is a relational database. The database contains a large set of tables such as agencies, keywords, and data sets—the API brings together tables into a single machine-readable endpoint by acting as an intermediary that retrieves, processes, and presents data.

The API is built using the FastAPI, a Python-based backend that provides routes to database objects with a dynamically generated set of documentation and OpenAPI specifications. The API is containerized and served from Kubernetes and performs calls to functions from the PostgreSQL database. These PostgreSQL function calls optimize query time by making reference to precomputed database entries in order to decouple query runtime from query complexity. Database updates are propagated to the API via a rollover mechanism that ensures the API will always be performant enough to drive user interface (UI) applications with minimal downtime.

For those with the skillsets, the ability to expand the capabilities beyond what is available in the prebuilt dashboards is simple using our REST API. We provide an HTTP-based application programming interface (hereafter web-API) that can be interacted with from any internet-connected device. The web-API has endpoints that collectively queries to allow individual components of data and aggreged queries into the data. The web-API enables machine readable queries that a variety of stakeholders can query for both script and visualization purposes. APIs play a pivotal role in promoting data sharing and integration in scientific research (Kumar & Falhi, 2022). By providing a standardized interface, APIs enable researchers to access, retrieve, and merge data from disparate sources, fostering comprehensive data assets for analysis. This integration empowers stakeholders to derive novel insights and discoveries that may not be attainable using isolated data sets. This type of use would complement the dashboards, enabling customized or integrated layout for a wide range of use-cases.

Expanding APIs can be a strategic move to enhance functionality and continue to expand and meet the needs of our stakeholders. To achieve this, the API can be expanded with new endpoints, security enhancements, and improved testing and deployment features. Adding new endpoints exposes additional features and data while ensuring backward compatibility through versioning. Examples of these can be aggregate queries with advanced filtering that put together multiple pieces of information to one coherent query. Questions such as: What are the most cited publications being published in the state of California? What are the top authors associated with a publication title? If these can be answered with a single query, would that make the API more effective? Furthermore, the API can expand security and control with various authentication methods and fine-grained authorization. Currently, the API endpoints are all publicly available, but this can be expanded to have authentication and access control lists. Lastly, more robust testing and caching mechanisms can be implemented to improve reliability and response times, while comprehensive documentation, developer tools, and error handling enhancements simplify integration for developers. Regular user feedback and communication with developers can help prioritize expansion efforts effectively.

6.2. Dashboards

Demonstrating the value of data through an online dashboard offers several benefits for organizations, stakeholders, and users. Online dashboards offer the ability to make information accessible to users in a variety of domains (Bach et al., 2022). These dashboards are powerful tools that provide a visual representation of data, making complex information more accessible, understandable, and actionable. The dashboards use an API backend (see Section 6.1) and the Tableau visualization tool. Tableau is a powerful data visualization and business intelligence software that is widely used for creating interactive and dynamic dashboards. It allows users to connect, visualize, and share data in a highly intuitive and user-friendly manner. By using Tableau, we are leveraging its strengths in transforming complex data into visually appealing and interactive dashboards that facilitate data-driven decision-making. By displaying key information and metrics in an easily digestible manner, users can monitor trends, enabling faster and more accurate decision-making processes. Furthermore, online dashboards often allow us to offer customized views and select the specific metrics they want to monitor.

At the high level, the web-based online dashboards provide statistics on the authors, publications, topics, and geographical information on usage of data assets. The dashboards leverage interactive filtration of data and subsequent visualization updates. The dashboards provide a simple preconfigured interface where stakeholders can interact with the rich data products in an ‘out-of-the-box’ manner—knowledge of visualization tools, programming languages, web-based protocols, and so on, are not required.

For each agency, dashboards are offered (see Figure 2 for an example) to show information at three different levels—publication level, by geographical location and institution, and usage over time. These offer a filtered view to display data set usage for each agency and allow users to filter and dive deep into data set information.

Figure 2. Dashboard screenshot.

6.3. SciServer

At the most granular level, direct access to query the underlying relational database management system (RDBMS) is provided via the SciServer platform (Taghizadeh-Popp et al., 2020). SciServer provides a unified science platform, exposing petabytes of data archives while providing compute capabilities to make use of them without download and via a low-latency high-bandwidth internal network. SciServer also has collaborative features restricting access to data and resources while enabling sharing. All of this combined with the rich data model and arbitrary SQL (structured query language) query capability means that even the most esoteric questions posed to the data can be explored interactively by those with the necessary coding expertise (typically in Python or R).

SciServer was born out of an NFS Data Infrastructure Building Blocks (DIBBs) grant to create a system for leveled and collaborative access to big (measured in PBs) scientific data and computational facilities. SciServer is both a software stack deployed at several locations globally and a production system at Johns Hopkins University (JHU) serving thousands of users for free worldwide. The JHU system (at gives file and database access to over 14PB of high-importance scientific data in a variety of fields. These include, in part, images and spectra from the Sloan Digital Sky Survey (Szalay et al., 2000), the Johns Hopkins Turbulence Database (Li et al., 2008), the Indra Cosmological Simulations (Falck et al., 2021), the Recount3 data set of uniformly processed RNA-seq data (Wilks et al., 2021), and a copy of the NASA High Energy Astrophysics Science Archive Research Center (HEASARC, 2022). Beyond these, SciServer hosts data and is utilized in wide-ranging disciplines including astronomy, cosmology, materials science, fluid dynamics, earth sciences, genomics, biology, and other life sciences. Many of these data sets are made publicly accessible, however, a cornerstone of SciServer is its flexible access control system enabling administrators, teams, and users to grant fine-controlled access to certain resources, including files, databases, and compute resources. Curated restricted data sets form a portion of the SciServer archive.

Complementing the archival data, SciServer provides users with compute resources to make use of the full scale of data. Compute resources are in close proximity to the data, connected by high-speed ethernet networks to the storage backbone. Combined with many-core CPU, GPU, and parallel compute systems, analysis and discovery on scales that would otherwise be prohibitive are made within reach. Several interfaces to compute are available, we describe the primary general purpose interface, Jupyter, below.

Generally, all these data and compute resources are made available for no cost to the user, making SciServer popular among a diverse group of researchers around the globe for collaborative research, but also for classroom use.

6.4. Jupyter Notebook Environments

Jupyter (Kluyver et al., 2016) has been a boon for the scientific community owing to its format, which enables mixture of text-based documentation, computer code, and graphical outputs (including plots) within the same document, making iterative analysis and discovery simple. Figure 3 shows a screen grab highlighting these major components of a Jupyter Notebook. Further, the web-based technologies in the user interface are conducive to both local and remote installation, the latter enabling running of code on a variety of hardware while requiring only a web browser on the client system. SciServer (and many of its peers) leverages this feature to provide an interface to computing resources and preconfigured software environments packaged with scientific code—including domain-specific environments such as for oceanography or high-energy astrophysics.

Combining with the user interface, software environments, and computer hardware, SciServer also enables the user to save and share results produced in analysis. So-called user-volumes are mapped to a filesystem directory on the underlying hardware and are accessed in that manner within a notebook environment. Private by default, user-volumes can be shared among other SciServer users, enabling collaborators to share both code and results in a familiar manner, without the requirement to transfer data.

 Figure 3. Screen grab of a portion of a Jupyter Notebook on SciServer, showing markdown-based text documentation, code cells, and graphical output all within the same document.

6.5. Direct Database Access

While the REST APIs described in Section 6.1 provides access to the data and are complete, there are some limitations in using them for general purpose due to the design goal of primarily serving online dashboards (Section 6.2). For example, one API call is not sufficient to extract all the data available, so in some cases several calls must be made and subsequently joined with independent code. For power users otherwise limited by this structure, SciServer provides an alternative.

SciServer enables access to the underlying relation database and all the tables for which there are not licensing issues in distributing data (e.g., validation snippets). Arbitrary (read-only) SQL (here specifically t-SQL) queries can be issued against databases containing per-agency results, and the output of such queries can be further analyzed or plotted within the Jupyter Notebook environment. Utilities to perform queries against the database programmatically are available by default in all Python and R environments in SciServer. Figure 3 shows an example of querying one of the Democratizing Data databases using the Python SDK (software development toolkit) and plotting the results. The full relational data model and corresponding databases and tables are described in the technical support material for Democratizing Data (Baas & Lemson, 2024).

To motivate the utility of the direct database access mode, consider, for example, a situation where we would like to find the top authors by their respective data set usage, as measured by the number of publications they write mentioning a data set. For simplicity, we will consider only the first author and are interested in the top 10 by number of mentions across all publications and aliases. There are several ways to retrieve the data necessary to do this via the API, however, all require more than a single call. The simplest way would be to retrieve all “/dyad,” “/publication_author,” and “/author” and write the join logic within external code. Using SQL, we can express this using a single query:

SELECT TOP 10 da.alias, concat(a.family_name, ', ', a.given_name) author, count(*) N

FROM dyad d

JOIN dataset_alias da ON d.alias_id = da.alias_id

JOIN publication p ON = d.publication_id

JOIN publication_author a ON a.publication_id =

WHERE family_name IS NOT NULL AND a.author_position = 1

GROUP BY da.alias,, concat(a.family_name, ', ', a.given_name)


This might produce output similar to that shown in Figure 4. From the listing above, it is evident that special skills and knowledge are required to make use of this feature; we expect that most consumers of the data will use the higher level offerings such as the dashboards or future products derived from the API, while power users will enjoy the flexibility of expressing queries in SQL along with the simplicity of analyzing results in Jupyter Notebooks on SciServer.

Figure 4. Showing output from an example SQL query to the relational databases hosted at Johns Hopkins University, accessible via SciServer.

In addition to the SciServer environment, Elsevier provides access for selected researchers to a Sandbox environment hosted by the International Center for the Study of Research (ICSR) Lab. The ICSR Lab is a cloud-based computational platform powered by Databricks on AWS infrastructure. It enables researchers to analyze large structured data sets, including from Scopus and ScienceDirect, by writing queries in ‘notebooks,’ which support the interactive running of code written in one of several programming languages, including Python/PySpark, SQL, R, and Scala.

For the Democratizing Data project, content from ScienceDirect has been provided to allow researchers to further optimize the machine learning models.

7. Future Directions

In this R&D phase, the project team has been ‘flying the plane as it’s being built,’ learning key points to increase efficiency, integrate quality assurance processes, and best demonstrate the use of data assets. Looking to the future, the goal is to operationalize the flow and production of data, to enable entities to examine data asset usage at scale and demonstrate the value of the painstaking results of data asset development to the public. Additionally, these collaborations have brought to the forefront additional next steps to incorporate increased transparency into these processes, improve finding data asset uses, improve understanding of data asset usage, and expand the definition of data asset use. We do not cover the development of the machine learning models in this section as that is covered in another article within this issue (Hausen & Azarbonyad, 2024).

7.1. Increasing Transparency Through the Administrative Dashboard

At the time of writing, agency feedback is done manually in an ad hoc exchange between the agency and the project team. In a future operational model, the intention is for this review to be conducted by the agency staff within an administrative dashboard that will also enable agencies to track data flow and production in real time.

With an administrative dashboard, an agency can directly enter data assets and aliases into a table that, when complete, will be transmitted (or pulled) to the Elsevier team. A possible model for such a dashboard is provided in Figure 5. The table retains information in a uniform manner with timestamps and logging and enables updates to be made and tracked. Data continues to be exchanged and transmitted via the administrative dashboard at the relevant points during the search, as follows:

  • Data Sets. The final list of data assets to be searched, along with aliases.

  • Search Corpus. Data from searching ScienceDirect, including list of topics and counts of associated publications and date of transfer.

  • Model Results. Data from searching the Scopus Search Corpus, including number of records in Scopus matching specified search parameters and topics of those records.

  • Validation. Progress on manual validation, including how many dyads will go through validation, and how many dyads have been validated.

  • Final Results. Validated dyads and metadata on associated Scopus records.

Figure 5. Illustration of potential administrative dashboard for agency use.

7.2. Expanding the Search Corpus

The current process looks to assess the impact of data assets by exploring how those assets have been used in research publications, that is, how they have been used by the research community in their endeavors to generate new knowledge. In terms of assessing data asset impact, we can call this the ‘research perspective.’

Other perspectives are, of course, equally important, for example, these data assets may be used in other activities, such as to underpin innovation, inform policy development, support wider economic activity, and to contribute to improvements in health and well-being. We consider each of these to be an important perspective worthy, in time, of being explored.

To support this, we would need access to other comprehensive sources of information. In this respect, we are already looking at other full-text repositories such as patent databases, clinical studies databases, or repositories of policy studies and these are likely to form the immediate next steps in the evolution of our thinking.

In addition, in the longer term, such thinking could be extended to explore how other data types could be used. For example, being able to analyze video content would open up the possibility of better exploring the extent to which news stories have relied upon, or drawn from, such data assets.

The research perspective has demonstrated that the impact of data assets can be successfully analyzed using the Democratizing Data approaches and it will be interesting to explore how readily those approaches can be applied to the other perspectives.

7.3. Incorporating Additional Measures of Use/Metrics

The metadata that is generated for the publications matched with the target data assets includes standard bibliometric measures such as citation counts, which can then be used to generate measures such as the FWCI. Such linked data can then be used to provide comparable information on the impact (scientific impact in the case of FWCI) of publications making use of the target data assets.

However, even if we assume a continued focus on the research perspective as the vehicle for which to assess the impact of the data assets, other options and ideas present themselves. The following provide some illustrative examples, although it is emphasized that these are not meant to be exhaustive:

  • Within the scientometrics community, the use of so-called alternative metrics such as downloads, news media mentions, and policy-related citations are increasingly being explored to capture impact beyond the scientific community and our approach could readily be expanded to include such measures.

  • Similarly, the prior art cited by patents often include nonpatent references that include citations to research publications. Again, the inclusion of patent citations to publications would represent a readily achievable addition to capture knowledge translation into innovation.

  • Another aspect that could be considered is to assess the breadth of the research linked to specific target data assets. For example, and assuming such considerations are important to agency colleagues, bibliometric measures exist that would enable capturing the disciplinary diversity 1) of the diffusion of target data assets in the scientific literature and 2) of the knowledge/team members integrated in the publications associated with a data asset. Both perspectives can be quantified using a range of diversity metrics that are often associated with concepts such as interdisciplinarity and multidisciplinarity (Pinheiro et al., 2021), and such metrics could be applied to other diversity dimensions (e.g., geographic/sectoral diffusion and integration).

  • As a final example of the additional measures that could be considered, it is worth noting that the machine learning models employed to uncover references (formal or not) to target data sets within the scientific literature also identified ‘generic’ data sets, that is, false positives extracted from the machine learning models prior to the validation stage that are closely related to, but different from target data sets. If we could develop automated approaches to classify and group those data assets (‘generic’ plus target), it would be possible to undertake assessments of the ‘market share’ a target data asset has, that is, how does its use across publications compare with comparable data assets?

Of course, in exploring such possibilities, close collaboration will be needed with the agencies and their communities so that we can determine if such metrics will indeed be of interest and value. During such engagements, our expectation is that other measures of use/impact and their associated characteristics will emerge.

7.4. Incorporating Tailored Classifications

Within the research perspective that has formed the focus of our work, we have used the research topics that are offered with Scopus and its associated systems. This has afforded the team the opportunity to aggregate and analyze search outputs using a number of dimensions, including the All Science Journal Classification and Sustainable Development Goals. The classification that has typically been employed is the Klavans and Boyak (2017) Topics described earlier in the article.

While this approach has allowed for detailed analysis to be undertaken, one weakness we have observed is a disconnect between the Topic names and the terms and concepts that are most meaningful to the agency teams themselves. In this respect it is worth noting that agencies typically have their own classifications or keywords, often reflecting their priority areas, which they use to both manage their activities and evaluate those activities.

A specific challenge the current team would like to address going forward would be to leverage artificial intelligence approaches to group and analyze the search outputs in a way that can be tailored to the agency classifications and priority areas. Again, close engagement with the agency and their communities will be needed to ensure that such models accurately represent the specific agency’s interests.

7.5. Expanding Dissemination of Data via Dashboards and API

The initial set of dashboards and API endpoints was tailored to specific agency feedback on what would best serve their needs and was of limited scope. Both the API and dashboard endpoints can further be expanded. The API would be more useful to a wider variety of stakeholders if it could have more complex API endpoints that aggregate queries together. Furthermore, the API codebase could be made publicly available through GitHub repos and enable agencies and interested parties to develop and contribute their own custom queries to the codebase. The dashboards, built on Tableau, can also be expanded and customized further to help answer each agency’s audience needs. Just as done with the structures of the existing dashboards, each agency can make domain- or audience-specific dashboards. We can use dashboards to further enable exploration of the data and extrapolate useful information depending on the audience—funding sources, publication writers, and so on.

8. Concluding Comments

Within this article we have sought to provide a description of the process we have developed to assess the impact of data sets and data assets that have been produced by our U.S. federal agency partners. We have sought to identify the limitations in the current approach and, in Section 7, to identify areas of future development (although we have excluded consideration of the machine learning models, which are covered by Hausen & Azarbonyad, 2024).

The Democratizing Data infrastructure we have described represents an end-to-end system to detect, and provide statistics and metadata about, usage of U.S. government agency-produced data within the scientific community. The system employs community-developed machine learning models that search for mentions of a data set within the industry-leading Scopus full-text archive. The processes and infrastructure we have developed include: telemetry from Elsevier to JHU; ingestion into a managed relational database system, making SQL available as an access mode; a system for validation of results by subject matter experts; and graphical dashboards, a web-hosted application programming interface, and Jupyter Notebook–based modes of data inspection and analysis.

While wishing to ensure we have a scalable, production-ready process, we recognize Democratizing Data is a journey both for us and the agencies. In assessing future work priorities, we will therefore take our lead from the feedback provided by our agency partners. At the time of writing, we are aware that partners are particularly interested in how the results are being classified (Section 7.4). During the year ahead (2024), the introduction of alternative classification is set to be one of the priorities.

The Democratizing Data Initiative has progressed rapidly since its inception and has already proven itself capable of producing an analysis of the impact of data sets that was hitherto unavailable to our agency partners. The collaborative approach we have employed is at the heart of this success. There are nevertheless many improvements to yet introduce and we look forward to working with our partners to ensure that the value associated with the public data sets they are responsible for is fully understood, appreciated, and leveraged by all interested communities. The government and the taxpayer expect nothing less.


The authors would collectively like to acknowledge the help and support of a number of colleagues who have helped develop and run the processes described in this article. In particular, we acknowledge Rafael Ladislau (Democratizing Data consultant), Manuchehr Taghizadeh-Popp (Institute for Data Intensive Engineering and Science, Johns Hopkins University) and Carrie Arnold, Christian R. Garcia and Kevin Price (all at Texas Advanced Computing Center, University of Texas at Austin) for their essential contributions to software development. Furthermore,  we are grateful to the contributions of Craig Jansen (Texas Advanced Computing Center, University of Texas at Austin) in relation to defining the user experience and creating the wireframes that were developed as part of these projects.

Disclosure Statement

As regards financial support, the authors would also collectively acknowledge the kind support of the Patrick J. McGovern Foundation, the National Center for Science and Engineering Statistics (NCSES) of the U.S. National Science Foundation (NSF) and both the Economic Research Service (ERS) and the National Agricultural Statistics Service (NASS) of the U.S. Department of Agriculture (USDA).  In addition, Arik Mitschang and Gerard Lemson acknowledge support provided through the generosity of Eric and Wendy Schmidt by recommendation of the Schmidt Futures program.


Aghaei Chadegani, A., Salehi, H., Md Yunus, M. M., Farhadi, H., Fooladi, M., Farhadi, M., & Ale Ebrahim, N. (2013). A comparison between two main academic literature collections: Web of science and Scopus databases. Asian Social Science, 9(5).

Archambault, É., & Larivière, V. (2010). The limits of bibliometrics for the analysis of the social sciences and humanities literature. In World Social Science Report: Knowledge Divides (pp. 251–254). UNESCO Publishing and the International Social Science Council.

Baas J. & Lemson, G. (2024). The development of a functional data model for democratizing our data applications.

Bach, B., Freeman, E., Abdul-Rahman, A., Turkay, C., Khan, S., Fan, Y., & Chen, M. (2022, October 16–21). Dashboard design patterns [Paper presentation]. IEEE VIS conference, Oklahoma City, OK.

Coleridge Initiative. (2021a). Coleridge Initiative – Show US the data. Kaggle.

Coleridge Initiative. (2021b). rc-kaggle-models / 1st Zalo FTW /. GitHub.

Coleridge Initiative. (2021c). rc-kaggle-models / 2nd Chun Ming Lee /. GitHub.

Coleridge Initiative. (2021d). rc-kaggle-models / 3rd Mikhail Arkhipov /. GitHub.

Elsevier. (2022). Facts about …ScienceDirect. Retrieved February 21, 2024, from

Elsevier. (2023a). SciVal topics of prominence. Retrieved September 19, 2023, from

Elsevier. (2023b). Scopus content coverage guide. Retrieved September 19, 2023, from

Falck, B., Wang, J., Jenkins, A., Lemson, G., Medvedev, D., Neyrinck, M. C., & Szalay, A. S. (2021). Indra: A public computationally accessible suite of cosmological N-body simulations. Monthly Notices of the Royal Astronomical Society, 506(2), 2659–2670.

Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019).

Hausen, R., & Azarbonyad, H. (2024). Discovering datasets through machine learning: An ensemble approach to uncovering the prevalence of government-funded datasets. Harvard Data Science Review, (Special Issue 4).

HEASARC. (2022). HEASARC: NASA’s archive of data on energetic phenomena. Retrieved December 16, 2022, from

Khan, N., Thelwall, M., & Kousha, K. (2021). Measuring the impact of biodiversity datasets: Data reuse, citations and altmetrics. Scientometrics, 126(4), 3621–3639.

Klavans, R., & Boyack, K. W. (2017). Research portfolio analysis and topic prominence. Journal of Informetrics, 11(4), 1158–1174.

Kluyver, T., Ragan-Kelley, B., Pérez F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., & Willing, C., Jupyter Development Team. (2016). Jupyter Notebooks – a publishing format for reproducible computational workflows. In Positioning and power in academic publishing: Players, agents and agendas (pp. 87–90). IOS Press.

Kumar, B., & Al Falhi, O. (2022). Digital transformation through APIs. In 2022 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing, COM-IT-CON (pp. 623–628). IEEE.

Lane, J., Gimeno, G., Levitskaya, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2).

Lane, J., Spector, A., & Stebbins, M. (2024). An invisible hand for creating value from data. Harvard Data Science Review, (Special Issue 4).

Li, Y., Perlman, E., Wan, M., Yang, Y., Meneveau, C., Burns, R., Chen, S., Szalay, A., & Eyink, G. (2008). A public turbulence database cluster and applications to study Lagrangian evolution of velocity increments in turbulence. Journal of Turbulence, 9, Article N31.

McCullough, R. (2022, June 30). Scopus Roadmap: What’s new in 2022. Scopus.

Mongeon, P., & Paul-Hus, A. (2016). The journal coverage of Web of Science and Scopus: A comparative analysis. Scientometrics, 106, 213–228.

Nagaraj, A. (2024). A mapping lens for estimating data value. Harvard Data Science Review, (Special Issue 4).

Oliveira, A. S., de Barros, M. D., de Carvalho Pereira, F., Gomes, C. F. S., & da Costa, H. G. (2018). Prospective scenarios: A literature review on the Scopus database. Futures, 100, 20–33.

Pinheiro, H., Vignola-Gagné, E., & Campbell, D. (2021). A large-scale validation of the relationship between cross-disciplinary research and its uptake in policy-related documents, using the novel Overton altmetrics database. Quantitative Science Studies, 2(2), 616–642.

Purkayastha, A., Palmaro, E., Falk-Krzesinski, H. J., & Baas, J. (2019). Comparison of two article-level, field-independent citation metrics: Field-Weighted Citation Impact (FWCI) and Relative Citation Ratio (RCR). Journal of Informetrics, 13(2), 635–642.

Qi, L., He, Q., Chen, F., Zhang, X., Dou, W., & Ni, Q. (2022). Data-driven web APIs recommendation for building web applications. IEEE Transactions on Big Data, 8(3), 685–698.

Seatgeek. (2020). fuzzywuzzy 0.18.0. FuzzyWuzzy. Retrieved September 19, 2023, from

Science-Metrix. (2016). Bibliometrics and patent indicators for the science and engineering indicators 2016—Comparison of 2016 bibliometric indicators to 2014 indicators.

Szalay, A. S., Kunszt, P. Z., Thakar, A. R., Gray, J. & Slutz, D. (2000). The Sloan Digital Sky Survey and its archive. In N. Manset, C. Veillet, & D. Crabtree (Eds.), Astronomical Data Analysis Software and Systems IX (pp. 405–414). Astronomical Society of the Pacific.

Taghizadeh-Popp, M., Kim, J. W., Lemson, G., Medvedev, D., Raddick, M. J., Szalay, A.S., Thakar, A.R., Booker, J., Chhetri, C., Dobos, L., & Rippin, M. (2020). SciServer: A science platform for astronomy and beyond. Astronomy and Computing, 33, Article 100412.

White House Office of Science and Technology Policy. (2022). Ensuring free, immediate, and equitable access to federally funded research [Memo]. Executive Office of the President of the United States.

Wilks, C., Zheng, S. C., Chen, F.Y., Charles, R., Solomon, B., Ling, J.P. Imada, E. L., Zhang D., Joseph, L., Leek, J. T. Jaffe, A.E., Nellore, A., Collado-Torres, L., Hansen, K. D., & Langmead, B. (2021). recount3: summaries and queries for large-scale RNA-seq expression and splicing. Genome Biology, 22(1), Article 323.

©2024 Attila Emecz, Arik Mitschang, Christina Zdawczyk, Maytal Dahan, Jeroen Baas, and Gerard Lemson. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?