Data on How Science Is Made Can Make Science Better

Jamshid Sourati; Alexander Belikov; James Evans

doi:doi:10.1162/99608f92.d5684866

Abstract

Science is an engine of innovation and economic growth and a pathway to prosperity for countries around the world. The increasing availability of scientific publications today poses a data-driven opportunity to better understand and improve science. Scientific publications contain data on the content of published research and metadata on the context that gave rise to that research. Here we discuss and demonstrate the power of constructing, archiving, and analyzing links between scientific data and metadata to construct massive computational observatories of and for modern science. We show how these can be constructed using modern graph databases, and suggest some methods of analysis with potential to unleash sustained value for science and society. These scientific observatories would allow us to diagnose the health of the scientific workforce and institutions, and track the rate of scientific advance. They could enable us to better guide science policy and build portfolios of supported research that balance our societal commitments to diverse participation and prosperity. Moreover, they could enable scientists to surf the deluge of published research to open the scientific frontier in directions that do not follow the current, but open up new views and opportunities for others to follow. Linked scientific data can also enable the construction of artificial intelligence agents designed to complement the disciplinary focus of human scientific attention by proposing possibilities overlooked or underfunded by contemporary scientific institutions. Finally, we argue for the importance of ongoing political and legal support for the promotion of open, linked data to facilitate widespread benefit.

Keywords: science of science, linked data, science policy, artificial intelligence, networks, innovation

Media Summary

Economic growth and sustained prosperity are supported by global scientific and technical advances. With emerging tools from data science, publicly available data on science can provide society with a unique and powerful opportunity to understand science and improve how it is organized, how it reflects robust facts about the world, and how it searches for useful discoveries. The scientific data most widely available come from scientific articles. Article data in the form of text describes the ‘what’ of science—what ideas do scientists claim, demonstrate, and defend about the social, natural, and engineered world. Article metadata describes the ‘who’ (performed and wrote-up the research), ‘when’ (was the article submitted and published), ‘where’ (was the work performed and presented), and ‘how’ (was the research supported). In this article, we detail a research program that uses data science tools and databases to link and continuously analyze scientific article text and metadata across all of science in order to monitor and intelligently guide scientific activity to better reflect society’s values and accomplish society’s needs. The scientific observatories that result would enable society to monitor the diversity of scientists, the productivity of institutions that contribute to science, and the rate and direction of scientific advances in all areas. These insights could assist public policy efforts to improve the scientific enterprise through targeted funding for education and research. We provide specific examples of how scientific data observatories could help us better diversify participation in science, reduce the bias of scientific facts, and guide more effective scientific investigations. We also show how artificial intelligence algorithms can be trained on this data to help accelerate science. We conclude by discussing the importance of promoting open access scientific data for societal benefit.

1. Introduction

The exponential increase in scientific publications over the past two centuries has produced a deluge of scientific articles in an expanding array of fields (de Solla Price, 1963). When publicly available, these publications represent an array of scientific sensors deployed across the scientific system, sustained by a motivation for individual priority (Merton, 1957; Partha & David, 1994). Not only are there more papers published than ever before, but the open access movement and a growing culture of open preprints have unleashed them to the world ( Evans & Reimer, 2009; Nielsen, 2012). This poses a powerful opportunity for science and society. Science is society’s primary engine of economic growth (Jones & Summers, 2020), and linking data and metadata can enable construction of scientific observatories that allow us to better demonstrate the value and distributed impact of science in society. By revealing aspects of how science is made, they also hold the potential to improve the efficiency and impact of science, multiplying the investments we make in it (Azoulay et al., 2018; Fortunato et al., 2018).

A growing number of social and natural scientists have begun to study scientific metadata about how individuals and institutions engage in science to understand and predict trends in scientific commitment, productivity, and attention, which can improve science policy (Fortunato et al., 2018; Merton, 1973; Partha & David, 1994). Metadata on scientific authors, institutions, funders, publishers, conferences, and host countries reveals how science is organized (Gazni et al., 2012; Larivière et al., 2016; Wu et al., 2019), supported (Azoulay & Li, 2020; Jacob & Lefgren, 2011), catalyzed, and disseminated (Fortunato et al., 2018). Alternatively, scientific knowledge embedded within article text has begun to be leveraged by domain scientists to perform meta-analyses that yield improved certainty about scientific claims (Jadad et al., 1998) and to accelerate the development of high-value hypotheses for prediction (Tshitoyan et al., 2019; Spangler et al., 2014).

In this article, we argue for the critical, ongoing linkage of these two data sources and their combined analysis to create an observatory that could enable strategic action to increase value for science and society. Funding agencies and researchers often make choices that produce research papers in response to very local epistemic and pragmatic concerns (Knorr-Cetina, 2013; Latour, 1987; Latour & Woolgar, 1979). Continuous linkage between data and metadata would represent a valuable investment for global scientific funding agencies and could be used to dramatically enhance existing statistical programs, such as the annually produced U.S. Science and Engineering Indicators. Here we specifically propose the use of graph databases to increase the speed and facility with which scientific metadata and text may be combined and made available for modern data analytic techniques. Then we illustrate how combining metadata context on how science is made with article content data on scientific claims could dramatically improve scientific assessments of reasoned certainty (Belikov et al., 2020; Danchev et al., 2019), and also accelerate our ability to prospect for and predict new scientific discoveries (Evans & Rzhetsky, 2010; Rzhetsky et al., 2015; Sourati & Evans, 2021).

Accounting for the distribution and dependency between human scientists allows us to account for the independence of experiments and the likelihood of prior search within a region of scientific possibility. Because social reinforcement and repeated communication artificially inflates confidence, and because human collective attention is limited, taking into account the distribution of human scientists across scientific topics, methods, and claims can allow us to correct bias and reduce variance in assessments of scientific certainty and foresight. These insights could also form the basis of valuable information services to help individual scientists and scholars not only survive the deluge of publication data but ‘surf’ it to the open breaks that would make their research most consequential for science as a whole.

2. Representing the Graph of Science

Scientific processes, unfolding over time, can be naturally observed in the quanta of research represented in published articles. A publication is a collection of claims or propositions stated to be observed or experimentally demonstrated and implied to be true. The context leading to a published manuscript includes the formation of a scientific team whose members are affiliated with various institutions, the choice of data and tools, the selection of publishing medium, and the peer review process by which manuscripts are evaluated and certified by peers. Shared affiliations, tools, and publishing media (e.g., conference or journal) may serve as a proxy for social awareness. One team is more likely to be aware of claims published by another if the two share an affiliation or present their work at the same event.

Scientific claims can thereby be placed within a complex network, partially observable through metadata. Given a vast number of publications in tens of thousands of journals, we need a natural way to represent that information about scientific knowledge production to enable the representation of multicomponent networks, preserve hyper-edges that connect multiple entities simultaneously to natively encode our understanding about scientific artifacts and their production, allow for expressive querying to obtain both local network measures and local snapshots of the global network, and facilitate efficient and scalable computation.

These requirements are partially satisfied by modern graph databases, such as Neo4j and ArangoDB. They provide scalable storage engines and expressive query languages. They are examples of natural, nonspecialized tools that shift the burden of graph data manipulation from the shoulders of the researcher. In Figure 1 we present a schema of vertex and edge collections developed for the simple representation of Clarivate’s Web of Science data set developed for ArangoDB. Complex computations such as Eigenfactor, which calculates the ranked impact of a journal as a result of its recursive influence on and citation by other journals (and their influence on other journals, ad infinitum) (Bergstrom et al., 2008) can be represented as a concise ArangoDB query, considerably more transparent than its MySQL counterpart (see Figure 2).

While in the past network features have been used to evaluate the validity and reach of scientific claims, the next pinnacle is to model the complete process of scientific production. One such suitable framework is that of graph neural networks (GNN). Particularly suitable approaches could include Message Passing Neural Networks (MPNN) that account for the nonindependent diffusion of knowledge and tools across the scientific space, Inverse Reinforcement Learning (IRL) to learn institutional rewards of science from the actions taken, and the development of differentiable neural computers (DNCs) designed to take all of these emergent features of science into account for predicting fruitful new directions.

**Figure 1. Connecting content and context.** Top: Simplified Web of Science graph database schema. Self-edges imply that one record references another (e.g., in publications, articles and conference papers reference other articles and conferences; in organizations, some organizations like schools and departments live within others.) Bottom: Pipeline of papers through graph database to novel data structures like the mixed hypergraph illustrated, which facilitate simulations of random walk processes, or creating representations suitable for use within Graph Neural Networks.

In the top of Figure 1 we present a simplified graph of relations between entities in the Web of Science data set. Such a representation assists in linking scientists through, for example, a common publishing medium (e.g., journals) or common conferences. Such ‘latent’ interactions, in turn, affect the outcomes of published research projects. In the bottom of that figure, we illustrate a data pipeline that processes papers (P₁–P₃), through a graph database as above, into a data structure to enable joint analysis of content and context, here a mixed hypergraph containing scientific entities (colored shapes) from within the paper, such as proteins, genetic events, or mentioned methods, and contextual entities (clear shapes) from the paper byline, such as authors, organizations, and publishing media.

**Figure 2. AQL query aggregating citations between journals.** The query traverses the graph from journals to articles published in 1970, then to cited articles issued in five previous publications and then to journals of cited articles. We cycle over journals j from the collection *media*; for each j we cycle over publications p from year 1970 connected to j by directed edges from the collection *publication_journal_edges*; for each p we cycle over publications *p2,* cited by p (connected by edges from *publications_publications_edges*); and finally we cycle by journals j2 in which publication p2 was published. The last three cycles are aggregated with a COLLECT statement and number of documents is COUNTed. INBOUND and OUTBOUND specify the direction in which directed edges are accessed.

Now we turn to two cases that illustrate how modeling network dependencies in the production of science could improve scientific certainty and discovery.

3. Robust Replication in Science

The primary output of the scientific process is a collection of overlapping claims, but its goal is the identification and dissemination of scientific facts. From historical data, shifts in scientific methodology and epistemological standards, we see that the correlation between established facts and published claims is far from unity (Henrion & Fischhoff, 1986). The nonreproducibility of scientific claims, and its widespread acknowledgment in the past two decades (Ioannidis, 2005; Rzhetsky et al., 2006), has been experienced as a replication crisis in many fields, from psychology to genetics. By evaluating dependencies between published claims, we can separate the degree to which any particular claim has been supported by independent experimental evidence versus reinforcement within the scientific community. Dependencies between published investigations can be induced by communication through shared coauthors, through awareness and citation of the same prior work, and by deployment of the same tools and techniques.

In our own prior work exploring hundreds of thousands of claims about drug–gene and gene–gene interactions from biomedical research, we identified that the more researchers were connected to previously published claims by affiliation, methodology, or other network dependencies (e.g., publishing in the same journals, attending the same conferences), the more they are likely to confirm previous findings and the less likely those findings are to replicate in future independent experiments (Belikov et al., 2020; Danchev et al., 2019). By incorporating these dependencies into machine learning predictions of robust replication, we demonstrate dramatic improvements in our ability to predict those claims that will replicate versus those that will not. By incorporating such predictions into scientific activity, we can improve the foundation of truth claims on which future science, technology, and medicine build. We can also directly broaden the focus of scientific meta-analysis from attempting to correct for overdispersion and the biasing influence of positive outliers (Borenstein et al., 2011) to additionally correcting for underdispersion and the biasing influence of apparently independent, but actually dependent, publications and claims.

4. Powerful Prediction for Science

Even with the expansion of scientists, journals, and articles, the rate of discoveries in many areas, such as the development of new molecular entities for medical use or the discovery of new photovoltaic or thermoelectric compounds, has slowed in recent years. Some attribute this to a ‘fishing out’ trend whereby new investigations in mature scientific areas yield diminishing marginal returns. But these diminishing returns may equally result from field-level dependencies associated with specialization as from the limits of nature (Jones, 2009; Jones & Weinberg, 2011). Recent work suggests that as the number of connections between members of a field expands, the number of topics they collectively explore contracts (Li et al., 2020; Wu et al., 2019). This is more directly demonstrated by the prevalence of high-profile abduction in science. Science may advance through deduction, whereby discoveries logically follow from axioms and prior findings, induction, whereby novel observations yield generalizations, or by abduction, where deduced theoretical expectations are violated, and the surprise inspires new theory (Bellucci & Pietarinen, 2020; Peirce, 1960). In recent studies that predict the contents and impact of new papers in the biological and physical sciences, scientific surprises, whereby unexpected components are brought together to forge new research designs and scientific claims, disproportionately attract scientific attention by garnering outsized citations and awards (Shi & Evans, 2019; Uzzi et al., 2013). Moreover, investigations that make the most surprising discoveries tend to involve surprising combinations of experience, not necessarily within team members or even the team as a whole, but across the team and its audience as ideas from one area of science are brought to bear on questions, challenges, and problems in a distant other. These findings align with others that suggest the importance of diversity in discipline (Shi & Evans, 2019; Uzzi et al., 2013), time (Mukherjee et al., 2017; Wu et al., 2019), and even personal identity (Hofstra et al., 2020) for increasing productive novelty and accelerating discovery in science.

In recent work, we have sought to demonstrate and build on these principles by designing algorithms that take into account the diversity and distribution of scientists in order to expand and improve scientific prediction. In our analysis of data on discoveries over time, we have shown that those made by human scientists are largely predictable by incorporating coauthorship networks, suggesting that the path of scientific progress is heavily restricted by the communication of past experiences. At the level of scientific systems, this further implies the frequency with which science exploits prior knowledge at the expense of exploring new possibilities.

Our own experiments show that by including the distribution of experts in our attempt to predict future discoveries dramatically increases the precision of our predictions by an average of 100% in predicting chemical compounds with valuable energy-related properties in material science (e.g., thermoelectricity, ferroelectricity, photovoltaic capacity) and up to 280% in predicting relevance of drugs to human diseases including COVID-19 (Sourati & Evans, 2021); see Figure 3 (top).1 For the first three energy-related properties, predictions are made from among more than 100,000 inorganic compounds, and for human diseases the candidates consist of about 4,000 approved drugs. Human diseases for which we forecast drug efficacy include epilepsy, asthma, and more than a hundred others. The left bar in each pair is the result of an artificial intelligence (AI) model based on content alone, precisely replicated from Tshitoyan et al. (2019) and extended to human disease. The right bar represents predictions using the same algorithm, but using content plus context—data plus metadata. These manifest an approximately 100% increase in precision at predicting future published discoveries over models based on content alone (Sourati & Evans, 2021). These results were obtained by comparing our expert-aware discovery prediction with those that relied exclusively on published contents via natural language processing, a common approach for knowledge mining among domain scientists.

By ignoring dependencies traced in metadata, scientists may unintentionally abnegate exploration of the universe of knowledge and deprive science of important, ground-breaking signals present in the dark regions between disciplines. In order to illuminate the existence of such signals, we here analyze data from a single field: assessment of thermoelectricity in inorganic materials, the property wherein temperature differences generate electricity. To illustrate, we first form a hypergraph where the set of hypernodes consists of paper authors, inorganic materials, and thermoelectricity keywords, and hyper-edges connecting these nodes comprise published papers (see Figure 1). Two nodes connect if they co-occur in at least one published investigation. We use this hypergraph to measure the cognitive availability of a range of inorganic compounds in conjunction with more than a hundred valuable industrial or medical properties by computing their shortest-path distance (SP-d) to property-related keywords. Higher SP-d, then, is inversely proportional to the likelihood of investigating a compound with respect to that property. We combine this signal with the semantic similarity of materials with respect to targeted properties and develop an AI model tuned to generate hypotheses regarding materials possessing those properties that appear alien to scientists and yet are scientifically promising. Such a model does not compete with human discoveries but complements them, using content and context to avoid human prediction.

Figure 3 (bottom) contrasts materials discoveries made by scientists with respect to 13 valuable disease-related properties compared with discoveries made by our AI model. Note the minimal overlap between human and complementary AI discoveries. We evaluated these predictions based on first-principles equations or data-driven simulations intended to theoretically assess the properties in question. For example, these include power factor (PF) of the compounds for simulating thermoelectricity (Dehkordi et al., 2015; Ricci et al., 2017; Tshitoyan et al., 2019), or the protein–protein interaction network for simulating therapeutic relevance (Gysi et al., 2021). The second striking property of these Venn diagrams is the frequency with which complementary AI discoveries post higher scores based on theory-inspired measures of the property in question than published discoveries, suggesting their scientific promise.

**Figure 3. Improving discovery with content and context**. Top: Precision scores associated with the prediction of future material discoveries using artificial intelligence (AI) models based on content alone, with the left three replicated from (Tshitoyan et al. (2019), compared with those based on content plus context and data plus metadata, which manifest a 100% increase (Sourati & Evans, 2021). Bottom: Discoveries made by scientists versus discoveries made by an AI model built with content and context to complement human discoveries. Numbers shown inside the regions indicate their size and the intensity of each region represents average first-principles scores precomputed for the associated materials. These scores indicate the potential of any given material to possess a targeted property. We used the following scores: power factor (PF) for thermoelectricity (Ricci et al., 2017; Tshitoyan et al., 2019), simultaneous polarization for ferroelectricity (Smidt et al., 2020), and cosine similarity of drugs and diseases in protein–protein interaction network for human diseases (Gysi et al., 2021).

5. Conclusion

We have shown how the merger of contextual metadata, tracing how science is produced, with textual data on scientific claims themselves can be constructed into scientific observatories and leveraged to improve scientific certainty and extend scientific discovery. Information from such observatories would enable us to better guide science policy to build portfolios of supported research that balance our varied societal commitments to scientific advance, economic prosperity, and diverse participation.

Consider natural areas ripe for monitoring and strategic improvement, such as the health and diversity of our scientific workforce and institutions. Recent work demonstrates that although women inventors are less prevalent than men in biomedicine, they are more likely to produce inventions relevant to women’s health (Koning et al., 2021). This underscores the importance of tracking and supporting diversity on a larger scale, and at every stage of the innovative process to broaden participation and also extend the fruits of innovation. Such observatories could promote the diversity of identities in science (Hofstra et al., 2020) and other dimensions of investigation critical to enhance scientific certainty and discovery, including diverse team sizes (Wu et al., 2019), inter- and cross-disciplinarity (Shi & Evans, 2019), distributed collaborations and far-flung institutional networks (Belikov et al., 2020; Danchev et al., 2019).

Scientific institutions such as peer review are also critical for the ongoing vitality of science, and yet recent studies demonstrate the degree to which older scientists throttle the contributions of young scientists, outsiders, and new directions in established areas (Zivin et al., 2019). Peer review is also influenced by the proximity of scientists to the work they evaluate, both for its epistemic appeal (Teplitskiy et al., 2018) and its potential to compete with their own (Boudreau et al., 2016). Data observatories could diagnose and balance peer review, compensating for bias and producing more science that we value.

As illustrated in the previous section, observatories built from scientific data plus metadata would also improve search for scientists, enabling them to better navigate the scientific frontier in directions that do not follow the crowd but open fruitful new opportunities for others to follow. All of these possibilities hinge on the critical linkage between scientific content and context—data and metadata. By not only logging and analyzing these in one-off projects but by building computational observatories for real-time monitoring of scientific diversity and discovery, this data-driven opportunity could enhance sustainable innovation across science. Nevertheless, we confirm that the influence of factors assessed by our proposed scientific observatory are hypotheses, and experiments will need to be performed to elucidate their causal efficacy for confident science policy.

We note that the principles we advocate here could be valuable not only for science and innovation but for news and information more broadly. Consider the importance of content linked with context in fact-checking and for observatories tracing the spread of (mis)information. Indeed, when claims are ordered chronologically and linked to their sources in a decisive manner, suspicions of information manipulation are quickly adjudicated.

Alongside data on scientific content and context, we believe that machine learning and artificial intelligence models will play an increasingly important role in accelerating scientific advance, but only if we structure scientific data in a way that highlights the values of science and preserves key signals regarding how science is produced. This will reveal cognitive biases and historical divisions that can be arbitraged to make outsized contributions. Moreover, this will require representing data on scientific content and context that retains its complex interconnections as a high-dimensional graph, profitably through graph databases or improved datastores to come. With these rich representations, we can grow models that race alongside scientists toward advance—drawing upon publications and contributing insights that enhance the collective reach and promise of science and the technology that flows from it (Sourati & Evans, 2021). Furthermore, using reinforcement learning approaches atop this data, we can explore and propose optimized science policies to guide scientific funding and organization, and learn novel neural architectures optimized to complement rather than substitute human scientific expertise and institutions for societal benefit.

We acknowledge that the application of such algorithms for scientific funding and search could have unintended consequences. If built on prior research patterns, they could amplify biases already present in the research community without natural correction. Nevertheless, bias is often easier to remedy in an algorithm than in a complex human institution (Mullainathan, 2019) and we recommend oversight by human scientists, policymakers, and other algorithms designed to check and balance those deployed.

In closing, we underscore the critical importance of continued political and legal support for open, linked scientific data to facilitate the widespread benefits we detail above. These include support for platforms that host and distribute preprints, funder mandates requiring the deposition of papers into open repositories, and the promotion of publisher open access initiatives, especially in biomedicine where profits have made them least common (Day et al., 2020; Gorelick & Li, 2021) but where the returns to society could be greatest. Such efforts would ensure the continuing viability of scientific observatories and the realization of their promise for science and society.

Disclosure Statement

Jamshid Sourati, Alexander Belikov, and James Evans have no financial or non-financial disclosures to share for this article.

References

Azoulay, P., Graff-Zivin, J., Uzzi, B., Wang, D., Williams, H., Evans, J. A., Jin, G. Z., Lu, S. F., Jones, B. F., Börner, K., Lakhani, K. R., Boudreau, K. J., & Guinan, E. C. (2018). Toward a more scientific science. Science, 361(6408), 1194–1197. https://doi.org/10.1126/science.aav2484

Azoulay, P, & Li, D. (2020). Scientific grant funding (Working Paper No. 26889). National Bureau of Economic Research. https://doi.org/10.3386/w26889

Belikov, A. V., Rzhetsky, A., & Evans, J. (2020). Detecting signal from science: The structure of research communities and prior knowledge improves prediction of genetic regulatory experiments. arXiv. https://doi.org/10.48550/arXiv.2008.09985

Bellucci, F., & Pietarinen, A.-V. (2020). Peirce on the justification of abduction. Studies in History and Philosophy of Science, 84, 12–19. https://doi.org/10.1016/j.shpsa.2020.04.003

Bergstrom, C. T., West, J. D., & Wiseman, M. A. (2008). The Eigenfactor^TM metrics. The Journal of Neuroscience: The Official Journal of the Society for Neuroscience, 28(45), 11433–11434. https://doi.org/10.1523/JNEUROSCI.0003-08.2008

Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2011). Introduction to meta-analysis. John Wiley & Sons. https://doi.org/10.1002/9780470743386

Boudreau, K. J., Guinan, E. C., Lakhani, K. R., & Riedl, C. (2016). Looking across and looking beyond the knowledge frontier: Intellectual distance, novelty, and resource allocation in science. Management Science, 62(10), 2765–2783. https://doi.org/10.1287/mnsc.2015.2285

Danchev, V., Rzhetsky, A., & Evans, J. A. (2019). Centralized scientific communities are less likely to generate replicable results. eLife, 8, Article e43094. https://doi.org/10.7554/eLife.43094

Day, S., Rennie, S., Luo, D., & Tucker, J. D. (2020). Open to the public: Paywalls and the public rationale for open access medical research publishing. Research Involvement and Engagement, 6, Article 8. https://doi.org/10.1186/s40900-020-0182-y

de Solla Price, D. J. (1963). Little science, big science. Columbia University Press. https://doi.org/10.7312/pric91844

Dehkordi, A. M., Zebarjadi, M., He, J., & Tritt, T. M. (2015). Thermoelectric power factor: Enhancement mechanisms and strategies for higher performance thermoelectric materials. Materials Science & Engineering: R: Reports, 97, 1–22. https://doi.org/10.1016/J.MSER.2015.08.001

Evans, J. A., & Reimer, J. (2009). Open access and global participation in science. Science, 323(5917), 1025. https://doi.org/10.1126/science.1154562

Evans, J., & Rzhetsky, A. (2010). Machine science. Science, 329(5990), 399–400. https://doi.org/10.1126/science.1189416

Fortunato, Santo, Carl T. Bergstrom, Katy Börner, James A. Evans, Dirk Helbing, Staša Milojević, Alexander M. Petersen, Radicchi, F., Sinatra, R., Uzzi, B., Vespignani, A., Waltman, L., Wang, D., & Barabási, A.-L. (2018). Science of science. Science, 359(6379), Article eaao0185. https://doi.org/10.1126/science.aao0185

Gazni, A., Sugimoto, C. R., & Didegah, F. (2012). Mapping world scientific collaboration: Authors, institutions, and countries. Journal of the American Society for Information Science and Technology, 63(2), 323–335. https://doi.org/10.1002/asi.21688

Gorelick, D., & Li, Ye. (2021). Reducing open access publication costs for biomedical researchers in the U.S.A. MIT Science Policy Review, 2, 91–99. https://doi.org/10.38105/spr.4nu1qfjf3t

Gysi, D. M., Valle, Í. D., Zitnik, M., Ameli, A., Gan, X., Varol, O., Sanchez, H., Baron, R. M., Ghiassian, D., Loscalzo, J., & Barabási, A.-L. (2021). Network medicine framework for identifying drug repurposing opportunities for COVID-19. Proceedings of the National Academy of Sciences, 118(19), Article e2025581118. https://doi.org/10.1073/pnas.2025581118

Henrion, M., & Fischhoff, B. (1986). Assessing uncertainty in physical constants. American Journal of Physics, 54(9), 791. https://doi.org/10.1119/1.14447

Hofstra, B., Kulkarni, V. V., Galvez, S. M.-N., He, B., Jurafsky, D., & McFarland, D. A. (2020). The diversity—Innovation paradox in science. Proceedings of the National Academy of Sciences, 117(17), 9284–9291. https://doi.org/10.1073/pnas.1915378117

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), Article e124. https://doi.org/10.1371/journal.pmed.0020124

Jacob, B. A., & Lefgren, L. (2011). The impact of research grant funding on scientific productivity. Journal of Public Economics, 95(9–10), 1168–1177. https://doi.org/10.1016/j.jpubeco.2011.05.005

Jadad, A. R., D. J. C., Jones, A., Klassen, T. P., Tugwell, P., Moher, M., & Moher, D. (1998). Methodology and reports of systematic reviews and meta-analyses: A comparison of Cochrane reviews with articles published in paper-based journals. The Journal of the American Medical Association, 280(3), 278–280. https://doi.org/10.1001/jama.280.3.278

Jones, B. F. (2009). The burden of knowledge and the “Death of the Renaissance Man”: Is innovation getting harder? The Review of Economic Studies, 76(1), 283–317. http://doi.org/10.3386/w11360

Jones, B. F., & Summers, L. H. (2020). A calculation of the social returns to innovation (Working Paper No. 27863). National Bureau of Economic Research. https://doi.org/10.3386/w27863

Jones, B. F., & Weinberg, B. A. (2011). Age dynamics in scientific creativity. Proceedings of the National Academy of Sciences of the United States of America, 108(47), 18910–18914. https://doi.org/10.1073/pnas.1102895108

Knorr-Cetina, K. D. (2013). The manufacture of knowledge: An essay on the constructivist and contextual nature of science. Elsevier.

Koning, R., Samila, S., & Ferguson, J.-P. (2021). Who do we invent for? Patents by women focus more on women’s health, but few women get to invent. Science, 372(6548), 1345–1348. https://doi.org/10.1126/science.aba6990

Larivière, V., Desrochers, N., Macaluso, B., Mongeon, P., Paul-Hus, A., & Sugimoto, C. R. (2016). Contributorship and division of labor in knowledge production. Social Studies of Science, 46(3), 417–435. https://doi.org/10.1177/0306312716650046

Latour, B. (1987). Science in action: How to follow scientists and engineers through society. Harvard University Press.

Latour, B., & Woolgar, S. (1979). Laboratory life: The social construction of social facts. Sage.

Li, L., Wu, L., & Evans, J. (2020). Social connection induces cultural contraction: Evidence from hyperbolic embeddings of networks and text. Poetics, 78, Article 101428. https://doi.org/10.1016/j.poetic.2019.101428

Merton, R. K. (1957). Priorities in scientific discovery: A chapter in the sociology of science. American Sociological Review, 22(6), 635–659. https://doi.org/10.2307/2089193

Merton, R. K. (1973). The sociology of science: Theoretical and empirical investigations. University of Chicago Press.

Mukherjee, S., Romero, D. M., Jones, B., & Uzzi, B. (2017). The nearly universal link between the age of past knowledge and tomorrow’s breakthroughs in science and technology: The hotspot. Science Advances, 3(4), Article e1601315. https://doi.org/10.1126/sciadv.1601315

Mullainathan, S. (2019, December 6). Biased algorithms are easier to fix than biased people. The New York Times. https://www.nytimes.com/2019/12/06/business/algorithm-bias-fix.html

Nielsen, M. (2012). Reinventing discovery: The new era of networked science. Princeton University Press.

Partha, D., & David, P. A. (1994). Toward a new economics of science. Research Policy, 23(5), 487–521. https://doi.org/10.1016/0048-7333(94)01002-1

Peirce, C. S. (1960). Collected papers of Charles Sanders Peirce. Harvard University Press.

Ricci, F., Chen, W., Aydemir, U., Snyder, G. J., Rignanese, G.-M., Jain, A., & Hautier, G. (2017). An ab initio electronic transport database for inorganic materials. Scientific Data, 4, Article 170085. https://doi.org/10.1038/sdata.2017.85

Rzhetsky, A., Foster, J. G., Foster, I. T., & Evans, J. A. (2015). Choosing experiments to accelerate collective discovery. Proceedings of the National Academy of Sciences, 112(47), 14569–14574. https://doi.org/10.1073/pnas.1509757112

Rzhetsky, A., Iossifov, I., Loh, J. M., & White, K. P. (2006). Microparadigms: Chains of collective reasoning in publications about molecular interactions. Proceedings of the National Academy of Sciences of the United States of America, 103(13), 4940–4945. https://doi.org/10.1073/pnas.0600591103

Shi, F., & Evans, J. (2019). Science and technology advance through surprise. arXiv. https://doi.org/10.48550/arXiv.1910.09370

Smidt, T. E., Mack, S. A., Reyes-Lillo, S. E., Jain, A. & Neaton, J. B. (2020). An automatically curated first-principles database of ferroelectrics. Scientific Data, 7(1), Article 72. https://doi.org/10.1038/s41597-020-0407-9

Sourati, J., & Evans, J. (2021). Accelerating science with human versus alien artificial intelligences. arXiv. https://doi.org/10.48550/arXiv.2104.05188

Spangler, S., Wilkins, A. D., Bachman, B. J., Nagarajan, M., Dayaram, T., Haas, P., Regenbogen, A., Pickering, C., Comer, A., Myers, J., Stanoi, I., Kato, L., Lelescu, A., Labrie, J. J., Parikh, N., Lisewski, A. M., Donehower, L., Chen, Y., & Lichtarge, O. (2014). Automated hypothesis generation based on mining scientific literature. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1877–1886). Association for Computing Machinery. https://doi.org/10.1145/2623330.2623667

Teplitskiy, M., Acuna, D., Elamrani-Raoult, A., Körding, K., & Evans, J. (2018). The sociology of scientific validity: How professional networks shape judgement in peer review. Research Policy, 47(9), 1825–1841. https://doi.org/10.1016/J.RESPOL.2018.06.014

Tshitoyan, V., Dagdelen, J., Weston, L., Dunn, A., Rong, Z., Kononova, O., Persson, K. A., Ceder, G., & Jain, A. (2019). Unsupervised word embeddings capture latent knowledge from materials science literature. Nature, 571(7763), 95–98. https://doi.org/10.1038/s41586-019-1335-8

Uzzi, B., Mukherjee, S., Stringer, M., & Jones, B. (2013). Atypical combinations and scientific impact. Science, 342(6157), 468–472. https://doi.org/10.1126/science.1240474

Wu, L., Wang, D., & Evans, J. A. (2019). Large teams develop and small teams disrupt science and technology. Nature, 566(7744), 378–382. https://doi.org/10.1038/s41586-019-0941-9

Zivin, J. G., Azoulay, P., & Fons-Rosen, G. (2019). Does science advance one funeral at a time? The American Economic Review, 109(8), 2889–2920. https://doi.org/10.1257/aer.20161574

©2022 Jamshid Sourati, Alexander Belikov, and James Evans. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.