Skip to main content
SearchLoginLogin or Signup

Data Fragmentation and Data Linking: A Threat and an Opportunity

Published onApr 28, 2022
Data Fragmentation and Data Linking: A Threat and an Opportunity
key-enterThis Pub is a Commentary on

Data is having ‘a moment,’ and an exciting one at that! Advances in technology and data science analytic techniques that enable processing staggering amounts of data have reenergized efforts to promote the use of evidence in decision-making. This is particularly true regarding decisions about investing public funds (Foundations for Evidence-Based Policymaking Act of 2018, 2019). The articles in this special theme on the Value of Science underscore a necessary condition for the opportunities inherent in this ‘moment’ to be realized by making a compelling case for the value of linking data to create strategic assets that may lead, in the long run, to an accessible data ecosystem that facilitates the generation of needed evidence. In so doing, the authors tacitly identify the current fragmentation of data in assets held by universities, government agencies, foundations, and other organizations not only as a dated approach to managing data, but also as a great threat to using data to benefit society.

Thankfully, the authors chart several paths to tackle this threat. These range from (a) researchers leveraging existing publications to construct comprehensive analytic files by combining the content of the papers (the ‘data’) with contextual information (the ‘meta-data,’ such as authorship) (Sourati et al., 2022) to (b) governments building a centralized and integrated data infrastructure as done in New Zealand (Jones et al., 2022) to (c) voluntary networks of individuals and institutions (governments included) creating ‘data mosaics’ of interconnected data files (Chang et al., 2022). Although varying widely, these alternative paths to solving the data fragmentation problem do not imply a choice or decision to be made about the best or optimal way forward. On the contrary, these solutions may coexist and contribute to generating a robust and flexible data ecosystem that can help researchers respond to current and future information needs. This is an opportunity not to be missed. Several real and potential barriers stand in the way and will require further attention. I highlight three: representativeness, incentives, and legal constraints.

The promise of leveraging the massive amounts of data that are generated daily for analytic purposes depends on constructing data files that are representative of the underlying population and, therefore, generalize to that population. For example, solutions that depend on individuals’, organizations’, or networks’ voluntary participation may result in rich data that do not generalize to larger or target populations. This is not a hypothetical, but a real and in fact often realized threat. It is strongly related to incentives to participate in a data ecosystem—either to create it or to help maintain it. This topic is briefly mentioned in several articles (Chang et al., 2022; Lane et al., 2022; Jones et al., 2022). A great deal more thought needs to be devoted to identifying existing incentives that could be leveraged and creating new ones where gaps exist. Legislative, regulatory, use, and other requirements could help incentivize participants in the data ecosystem, including collaborations with the private sector, but require strong enforcement mechanisms and political will. At the same time, legal constraints (often created for very good reasons, such as protecting individual privacy) can and do stand in the way of data sharing, which is required for data linking. This is not an insurmountable barrier, as shown by the articles in this issue, but one that requires a shift in the way we approach solutioning. The question ‘how can we make this happen?’ needs to be the driving force behind all conversations—even legal ones—rather than ‘why we cannot make this happen.’ This type of shift in mindset is not easy to effect, but will be necessary for any transformative, large-scale solution to become a reality. Attention to these and other barriers to data integration will be essential to pursue the wide range of evidence-building activities envisioned by the White House Office of Management and Budget (Vought, 2019) in support of multi-trillion–dollar federal government budgets.

Disclosure Statement

Clemencia Cosentino has no financial or non-financial disclosures to share for this article.


Chang, W.-Y., Garner, M., Basner, J., Weinberg, B., & Owen-Smith, J. (2022). A linked data mosaic for policy-relevant research on science and innovation: Value, transparency, rigor, and community. Harvard Data Science Review, 4(2).

Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019).

Jones, C., McDowell, A., Galvin, V., & Adams, D. (2022). Building on Aotearoa New Zealand’s integrated data infrastructure. Harvard Data Science Review, 4(2).

Lane, J., Gimeno, E., Levitskaya, E., Zhang, Z., & Zigoni, A. (2022). Data inventories for the modern age? Using data science to open government data. Harvard Data Science Review, 4(2).

Sourati, J., Belikov, A., & Evans, J. (2022). Data on how science is made can make science better. Harvard Data Science Review, 4(2).

Vought, R. T. (2019, July 10). Memorandum for heads of executive departments and agencies: Phase 1 implementation of the foundations for Evidence-Based Policymaking Act of 2018: Leaming agendas, personnel, and planning guidance. White House Office of Management and Budget.

©2022 Clemencia Cosentino. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?