Skip to main content
SearchLoginLogin or Signup

Is Causal Inference Compatible With Frictionless Reproducibility?

Published onMay 24, 2024
Is Causal Inference Compatible With Frictionless Reproducibility?
key-enterThis Pub is a Commentary on

I found David Donoho’s (2024) excellent article on “Data Science at the Singularity” to be both insightful and thought-provoking. Donoho’s central message is that the success of empirical machine learning can be attributed to the triad of data science principles Donoho calls “frictionless reproducibility”: shared data, public code re-execution, and community-wide challenges.

What I found most thought-provoking was Donoho’s insistence that these three principles can be applied broadly across fields. “Most importantly,” Donoho writes, “where the behaviors are not yet dependably present, there is no essential obstacle to turning them into everyday habits, except the interest and diligence of participating researchers.” The article points to optimization and spectroscopy as fields outside of machine learning that have begun to embrace these principles.

This view intrigued me. My background is in machine learning, a field that has reaped the benefits of frictionless reproducibility. More recently, I have been working on adapting methods from machine learning to study questions in applied economics, which has not embraced the principles of frictionless reproducibility to the extent that machine learning has.

After reading Donoho’s article, I began to ask myself: Why has one field embraced frictionless reproducibility and not the other? Is it a matter of time and interest, as Donoho’s article suggests? Or is there a fundamental incompatibility?

Here, I will use Donoho’s framework to assess these questions. To help guide my thinking, I will focus on one area of applied economics: causal inference. My goal is not to arrive at definitive answers, but rather to assess where causal inference is more and less compatible with frictionless reproducibility.

Frictionless Reproducibility and Empirical Causal Inference

First, a brief overview of causal inference. Broadly speaking, causal inference studies how to draw conclusions about causal relationships from data. For example, consider a data set containing information about participants who had the option to enroll in a job training program. A causal inference problem is to study whether enrolling in the job training program increases wage (or whether the data can be used to answer such a question in the first place).

Causal inferences are made by imposing assumptions about data and using statistical techniques to estimate effects. The gold standard for estimating causal relationships is to collect data from randomized controlled trials (RCTs), where participants are randomly assigned to interventions. Since RCTs are not always feasible due to cost or ethical concerns, much of causal inference focuses on making inferences from observational studies, requiring stricter assumptions and more advanced methods to make valid inferences.

Causal inference takes many flavors. For example, theorists may study the assumptions required to estimate treatment effects at asymptotically optimal rates. Can theoretical causal inference benefit from frictionless reproducibility? While Donoho focuses on computational fields, theoretical causal inference has its own analogues of the frictionless reproducibility triad: shared estimands instead of shared data, published proofs instead of published code, and theoretical evaluation metrics (e.g., rates of convergence) instead of computational ones.

However, I do not think that theoretical causal inference can have frictionless reproducibility. For one, it has frictions; while you can obtain a new machine learning model by changing one line of code, you cannot easily derive new asymptotic bounds by changing one line of a proof. This friction slows down progress. Moreover, many open problems are about the implications of different assumptions. It is hard for me to imagine how frictionless reproducibility can take hold when assumptions vary across participants in a field.

Instead, I will focus on empirical causal inference, by which I mean the study of causal methods in applied settings. This broad field encompasses any community that has coalesced around a shared set of assumptions (e.g., that valid causal inferences can be made from the observational data under consideration) and is focused on finding the best empirical strategies for estimating them.

How compatible is empirical causal inference with frictionless reproducibility? Let us walk through the three principles of frictionless reproducibility, and compare what they look like in machine learning to what they could look like in empirical causal inference:

  1. Data sharing. Empirical machine learning was built around a culture of data sharing, with the MNIST and ImageNet data sets being prominent examples. A similar culture is already thriving in empirical causal inference. The results of many RCTs and observational studies are already public with aggregated and searchable archives online.1 Of course, there are data sets that cannot be shared due to privacy constraints, for example, administrative nationwide data sets. Although universal openness is not always feasible, data sharing is already thriving.

  2. Re-execution. The ability to rerun models on shared data sets is a cornerstone of machine learning research, allowing the community to verify and extend new methods. Again, empirical causal inference has embraced a culture of code sharing. In fact, many journals require both data and code to be published alongside papers.2

  3. Challenges. This leaves the third principle, challenges. There are many examples of challenges in machine learning (e.g., the ImageNet Large Scale Visual Recognition Challenge), and the community has reaped the benefits of these competitions in driving innovation. There are aspects of the challenge paradigm that I think are very compatible with empirical causal inference, for example, public tasks and community engagement. But there is one component that I think is significantly harder to develop in empirical causal inference than in machine learning: quantifiable performance metrics.

    To see why, first consider machine learning. The goal of many machine learning problems is to make predictions; for example, to classify which object is present in a picture in ImageNet. Many of these problems come with labeled examples of what should be predicted (each ImageNet picture is labeled with the object that is present). A myriad of metrics exist for assessing a model’s performance by comparing its prediction to the given label.

    In contrast to prediction, the goal of empirical causal inference is to estimate the effect of an intervention. Crucially, while prediction problems often come with labels that predictions can be compared to, the true effect of interventions are often unknown. Quantifying performance is much more challenging in this setting.

While empirical causal inference is compatible with the first two principles of the frictionless reproducibility triad, it is more difficult to construct the quantifiable performance metrics that would underlie causal inference challenges. Still, this does not mean that challenges are unattainable. Below, I will highlight two approaches that have emerged for designing challenges in empirical causal inference:

  • Comparing RCTs to observational data. The central difficulty for quantifying performance metrics in empirical causal inference is that in many settings, ground-truth causal effects are unknown. However, even if practitioners do not know a true causal effect, RCT’s can provide more reliable estimates than observational data. So if an RCT and observational study exist in the same setting, methods for estimating causal effects from the observational study can be assessed by how closely they recover the RCT-based estimate.

    This insight has spurred ‘virtual challenges’ in applied econometrics. One prominent example is based on the work performed by Robert LaLonde in 1986, who evaluated the effectiveness of various methods by comparing estimates from observational data to those obtained from a randomized experiment (LaLonde, 1986). He found that many estimators used to analyze observational data did not accurately estimate the effect of a labor training program compared to the experimental benchmark. Since then, others have reanalyzed the data using more sophisticated methods (Dehejia & Wahba, 1999; Imbens, 2015), recovering estimates closer to the randomized study.

  • Simulating data. Another type of data set where it is feasible to attain causal effects is synthetic data sets that are simulated from known data-generating processes. Here, the ‘ground-truth’ effects are known because the data-generating process is known. Individual estimates can be quantitatively assessed by how they differ from these known effects.

    In challenges based on synthetic data, participants do not have access to the data-generating process, but organizers can compare and grade individual methods. One example is the American Causal Inference Conference (ACIC) challenge, which has been running since 2016 (Hahn et al., 2019). Organizers of the ACIC challenge simulate data that resembles real-world data from fields like health care and education, and participants are graded on their ability to estimate a series of estimands. Donoho’s article brings up another challenge based on synthetic data in the medical setting (Ruberg et al., 2024).

Despite difficulties in quantifying performance metrics, certain types of challenges can be and have been conducted in empirical causal inference. However, challenges have not had the same impact in empirical causal inference as they have had in machine learning; we have not seen an ‘ImageNet moment’ for causal inference. Why has empirical causal inference not embraced challenges to the same extent as machine learning? I do not have a definitive answer, but I will offer some thoughts.

One possibility is that despite the promise of causal inference challenges, there is too much difficulty in implementing them. For example, conducting RCT’s is expensive, and so collecting a large set of RCT-observational data pairs may be prohibitive. In comparison to conducting large-scale RCTs, collecting labels for machine learning challenges like ImageNet via crowdsourcing is cheap.

Another possibility relates to the representativeness of challenges in empirical causal inference. The common task framework is most valuable in settings where success on one challenge can provide insight into broader success for other problems. While many challenges in machine learning carry this property, it may be less common for empirical causal inference.

For example, consider a machine learning practitioner who wants to design a predictive model to classify hand-written digits. What can they glean by looking at the models that are successful at ImageNet, an object classification challenge? Empirically, models that do well on one kind of vision problem carry lessons for other kinds of vision problems. That is, even if the best models on ImageNet are poor at classifying handwriting, the lessons they provide about how to succeed at ImageNet—for example, architectural innovations such as residual connections and batch normalization—can be ported to other vision problems.

Now consider a causal inference practitioner who wants to use observational data to estimate the effect of a job training program on wage. Imagine a causal inference analogue of ImageNet, perhaps a challenge about estimating the effect of promotions on wage from observational data where RCTs are used to assess performance. What lessons can the practitioner draw from the successful methods in the challenge? On the one hand, both the challenge and the problem the practitioner is interested in are about drawing causal conclusions from labor data. However, there may be important differences: the two data sets could include different covariates or come from different countries. The underlying causal mechanisms may vary significantly, preventing the practitioner from drawing broader conclusions from the challenge. Compared to prediction challenges in machine learning, estimation challenges in causal inference promise less generalizability.

Ultimately, I do not think empirical causal inference is inherently incompatible with frictionless reproducibility. The principles of open data and code sharing are already there. However, developing community-wide challenges with quantified performance metrics and widespread buy-in may involve frictions.


Thank you to Susan Athey, Omeed Maghzian, Sendhil Mullainathan, Juan Carlos Perdomo, Ashesh Rambachan, and Suproteem Sarkar for helpful comments and conversations.

Disclosure Statement

Keyon Vafa has no financial or non-financial disclosures to share for this article.


Dehejia, R. H., & Wahba, S. (1999). Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American Statistical Association, 94(448), 1053–1062.

Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1).

Hahn, P. R., Dorie, V., & Murray, J. S. (2019). Atlantic Causal Inference Conference (ACIC) data analysis challenge 2017. ArXiv.

Imbens, G. W. (2015). Matching methods in practice: Three examples. Journal of Human Resources, 50(2), 373–419.

LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data. The American Economic Review, 76(4), 604–620.

Ruberg, S., Zhang, Y., Showalter, H., & Shen, L. (2024). A platform for comparing subgroup identification methodologies. Biometrical Journal, 66(1), Article 2200164.

©2024 Keyon Vafa. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

1 of 15
No comments here
Why not start the discussion?