Skip to main content
SearchLoginLogin or Signup

Spinning Up a Data Science Initiative at Harvard

Published onOct 27, 2022
Spinning Up a Data Science Initiative at Harvard


On the occasion of its fifth anniversary, the leadership of the Harvard Data Science Initiative look back to embrace and celebrate their foundational principles, touch on particular successes, and reflect on how the work has been by shaped by the exigencies of a complex academic environment and the societal tumult of the past 5 years.

Keywords: impact, data, university, interdisciplinary, Harvard

In October 2014, Harvard University’s Provost, Alan Garber, tasked a committee of faculty from around the university to explore the challenges and opportunities in data science at Harvard, and the appetite for a unified vision for the field. In April 2015, this committee completed a strategic planning process, outlining a vision for data science at Harvard that concluded that the university should take a leadership role. The committee’s work laid the foundation for what would become, in March 2017, the Harvard Data Science Initiative (HDSI). 

We look back now on the occasion of the initiative’s fifth anniversary to embrace and celebrate our foundational principles, touch on particular successes, and reflect on how the work of the initiative has been shaped by the exigencies of a complex academic environment and the societal tumult of the past 5 years. 

From the beginning, the HDSI has emphasized interdisciplinarity through our support of research and education. Where initial programming focused particularly on seeding activity that would span Harvard’s disciplinary breadth and build cohesion across this decentralized institution, more recent efforts have seen an intensified focus on work that can accelerate progress to new scientific results and toward impact. By acceleration, we mean enabling access to new methodologies, data, collaborators, and computing, so that researchers can unlock new research findings faster and better. By impact, we mean casting the methodological inquiries and their applications with the ultimate goal of producing actionable findings, that is, findings that can be directly translated into positive impact to society, governmental agencies, health prevention strategies, climate policies, medicine, energy, education, and the like.

Data science is a constantly changing and inherently outward-facing field, and one that is well situated in a university of Harvard’s breadth. There is a necessary and vital interaction between data science—its methodologies and application domains—and its ultimate impact, whether to science, individuals, or communities. Methodologies developed in the context of one domain can unlock new paths of inquiry when applied to a different problem. For example, a methodology designed to assess the causal effect of a 2010 earthquake on levels of post-traumatic stress disorder among the population of Chile can also assess the causal effect of exposure to air pollution on all causes of mortality. 

The wider context of an application domain—the type and quality of available data, the feasibility and socioeconomic or political environment of its implementation—can also guide methodological research. For example, to better track and control malaria outbreaks, researchers have sought to identify the genetic ‘relatedness’ of malaria parasites (recent shared ancestry), thereby developing a clearer understanding of the transmission of malaria. Careful statistical work is required in order to make use of globally diverse data sets collected by different researchers under different conditions. Recent work provides a rigorous analysis of the reliability of, and requirements for, accurate inference in regard to relatedness of parasite populations.

The narrower context of a methodology—the mathematical derivations, the probabilistic modeling, and the computational scaling—will in turn define the questions that can be answered in a domain. As an example of improved computational scaling, work completed in Stratos Idreos’s lab with the support of the HDSI has extended the Pareto frontier for the accuracy and training cost in working with deep machine learning (the “MotherNets” approach provides up to 35% faster training with reduced error rates by capturing the shared structure across an ensemble of networks before training).

While data science is quantitative, the nature of its relationship to broader societal trends is not unlike that of literature or the law, evolving to reflect and respond to the issues of our day and the needs of our world. As a cross-university initiative, the HDSI is charged with creating opportunities for this symbiosis between methodology and domain to flourish in a university where exciting, data-driven work can be found across each of our 12 schools. In 2017, the HDSI made a priority to connect researchers across the university to share ideas, collaborate, and nurture the next generation of data scientists. 

To do this, we surfaced five thematic areas that formed a framework for making these connections: Personalized Health, Evidence-Based Policy, Networks and Markets, Data-Driven Scientific Discovery, and Methodology. We established core programs to activate cross-disciplinary activity: a Competitive Research Fund to support faculty-led projects, a flagship Postdoctoral Fellowship Program to train independent, early-career investigators, a corporate membership program to foster two-way conversations with industry, the Harvard Data Science Review, and various seminars, workshops, and convening events, including our annual conference to showcase cutting-edge research and developments in education and industry.

  • A first cohort of Postdoctoral Fellows arrived on campus in Fall 2017, and included scholars who had completed doctoral work in disciplines as varied as marine biology, human development, and education. We pair Fellows with at least two faculty mentors, and with a mandate to be independent scholars but collaborative, and strive to create a community of researchers who are building together a connected and interdependent fellowship. Today we have eight Fellows on campus, 11 alumni who have gone into academia, and four who have pursued careers in industry.

  • In 2018, the HDSI welcomed our first Corporate Members. Companies including Elsevier, Amazon, and McKinsey established relationships with faculty working on topics of mutual interest. Industry researchers have participated in workshops on issues including gender in academia, the future of the workforce, and impact and measurement in the context of meeting Environmental, Social, and Governance (ESG) goals. Corporate Members have worked with each other and with faculty in a precompetitive space on cross-cutting issues that affect diverse sectors of industry. Today, we have seven members. 

  • And, in 2019, the HDSI partnered with MIT Press to launch this very Harvard Data Science Review, an open access peer-reviewed journal that aims to serve as a centralized and authoritative outlet for the growing field of data science. In its first four volumes, the journal has tackled myriad, wide-ranging issues; from differential privacy in the U.S. Census to authorship attribution in Lennon-McCartney songs. Together with the initiative’s public seminars and an annual, multi-day conference, the journal serves to amplify the reach of data science.

These initial activities, while interdisciplinary in approach, were focused largely on questions generated within disciplines (for example, improving the treatment of a disease, understanding economic trends through the analysis of historical data, mapping the causes of climate change). At the same time, the emergent social crisis of COVID-19 and the ongoing challenge of systemic racism in America have highlighted the need for a broader conceptualization of data science interdisciplinarity that is less-siloed, more scalable, and can respond to questions and challenges that arise from multiple disciplines. In their response to these crises, Harvard researchers demonstrated how work that was already underway in key methodological areas (for example, causal inference, control of false discovery, uncertainty quantification, visualization, scalable and robust inference) could be applied to emergent issues demanding an immediate response.

As COVID-19 began to spread widely in early 2020, the HDSI directed funding to support six faculty-led research projects that explored public trust in science in the context of the pandemic. In parallel, groups around campus pivoted entire research programs to conduct data-driven projects that have allowed us to better understand the transmission of the disease, and slow its spread. HDSI-funded researchers worked on ensuring privacy in COVID-19 epidemiological mobility data sets, characterizing the role of socioeconomic status in COVID-19 in Chile, developed new methods to combat COVID-19 misinformation, examine trust in COVID-19 vaccines, estimate the effects of wildfire exposure in exacerbating the severity of COVID, and developed methods for using explainable artificial intelligence to promote public trust in science, tackling the pandemic from myriad angles. We have since spun up a multiyear effort on Trust in Science under the leadership of Sheila Jasanoff.

In the same year, and in the aftermath of the murder by Minneapolis police of George Floyd, an unarmed Black man, the HDSI announced a series of actions it would take to address the structures in the field that reinforce bias and systemic racism. The initiative launched the Bias-Squared Program that “supports research, features speakers, and engages the data science community towards using data science to uncover bias, and understanding and combating the use of badly-conceived data science that can reinforce bias and inequity.” Under the program’s umbrella, we hosted seven keynote speakers, awarded $250,000 to faculty-led research projects, and expanded our 2021 Summer Graduate Fellowship Program to support a Fellow engaged in work that combats systemic racism. 

Both efforts, those directed toward combating COVID-19 and those directed toward addressing systemic racism, demanded activity across disciplines, not only to create knowledge and develop solutions but to implement those solutions effectively. The data science community needed to embrace a full spectrum of activity to quickly amplify the impact of research, moving almost in real time—from data collection and harmonization to data analysis, solution development, and communication with policymakers. In some cases, the same researchers who were tackling key questions related to COVID-19 were able to apply their new knowledge to aspects of systemic racism. In some cases, the research questions themselves overlapped, as investigators discovered the disparities in the health effects of COVID-19 on Black, Indigenous, and People Of Color (BIPOC) communities. 

In their scope and immediacy, these crises required researchers to embrace a more complex interdisciplinarity, where methodologists, experts across domains, and policy experts worked together concurrently to tackle far-reaching challenges. Clinicians and public health practitioners might be, on the one hand, on the front lines treating COVID-19 and collecting data on clinical outcomes and, on the other, developing and deploying interventions using the insights driven by that data. The immediacy of the endeavor required a reworking and acceleration of research-to-application workflows that suggests a paradigm for future work (see, for example, here)

This is what impact looks like, with conversations across domains, between methodologists and domain experts, with research progress accelerated through the forces of data science as engineering, computation, and methodology, and through the data itself, including considerations of its quantity, complexity, and quality. As data continue to proliferate, they present an ever more complex lens through which to understand the world. 

Data science as engineering is exemplified by efforts to improve the forecasting and nowcasting of disease outbreaks and epidemics through data engineering. In recent work led by Caroline Buckee and Mauricio Santillana and supported by the HDSI, researchers have integrated data from multiple sources, including mobile phone data for nearly 15 million anonymous subscribers in Kenya, to estimate seasonal travel patterns of rubella incidence. Statistical forecasts can be compared and improved through the integration of epidemiological data and Google data for tracking of influenza epidemics; and new mathematical models reveal how choice of intervention is influenced by the natural history of the infectious disease, its inherent transmissibility, and the intervention feasibility in the particular health care setting.

Data science as computation can drive impact by enabling new methods to be used at a new scale. This is exemplified by the work of Natesh Pillai, supported in part by a grant from the HDSI, who applies Monte Carlo simulation and Gaussian process modeling to historical and current temperature records, allowing a reliable quantification of uncertainty of temperature measurements, compensating for poor data quality, and ultimately allowing a more accurate assessment of climate change. 

Data science as methodology is exemplified through the work of Harvard physicist Matthew Schwartz, who in making use of new methods of machine learning has noted that “Finding new particles in modern experiments is like finding a particular piece of hay in a haystack. Luckily, hay-in-a-haystack problems are exactly what modern machine learning excels at solving.” Schwartz applies machine learning to persistent, fundamental questions in particle physics, including, ‘How many particles are there?’ and identifying the one-in-a-billion creation of a Higgs boson from an atom collision at the Large Hadron Collider at CERN. Deep learning can respond to the unique challenges of particle physics, enabling scientists to work directly with raw, minimally processed data, rather than high-level physically motivated variables. We have also launched a multiyear effort, supported by the Sloan Foundation, on causality, reflecting Harvard’s traditional strengths and the imperative presented by the challenges of working with complex, typically observational data.

As the Harvard Data Science Initiative enters its sixth year, we remain committed to working across Harvard to unite leading computer scientists, statisticians, and domain experts to derive meaningful and actionable insights that shape the new science of data. Today, we also double down on impact as a unifying lens through which to define and understand our work. This requires that we view data science not only as engineering, computation, and methodology but also in the full complexity of its societal context, with all the challenges and demands, whether technical or otherwise, that this presents. 

Where do we go from here? What do we want to accomplish in the next 10 years? The world is facing and will continue to face many crises: the pandemic, extreme weather events, mass shootings, misinformation, discrimination, wars, violation of human rights, energy crises, inequities in digital literacy, economic shocks, and many more that we cannot anticipate. How do we respond quickly and effectively to these crises? How do we turn crisis into opportunity? Especially, how do we ensure that data and data science are a force for good, enabling a healthier, more productive, and more fair world? On the one hand, we have an enormous opportunity: data is everywhere, computation at scale is available, and scientists feel a sense of urgency. On the other hand, we face many challenges: data are siloed and not harmonized, state-of-the art methodologies—even if already developed—often are not easily translatable or equipped to address the right questions, and access to data science and understanding of its methods is uneven. Our aspiration at the HDSI is to create a new movement based on the following foundational premise: engineering, computation, and methodology are the pillars of data science, accelerating scientific understanding and translating beneficial impact to the world.

Disclosure Statement

Francesca Dominici and Elizabeth Langdon-Gray have no financial or non-financial disclosures to share for this article. David Parkes is currently on sabbatical at DeepMind for the 2022–2023 academic year.

©2022 Francesca Dominici, Elizabeth Langdon-Gray, and David C. Parkes. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?