From controlled to undisciplined data: estimating causal effects in the era of data science using a potential outcome framework

This paper discusses the fundamental principles of causal inference - the area of statistics that estimates the effect of specific occurrences, treatments, interventions, and exposures on a given outcome from experimental and observational data. We explain the key assumptions required to identify causal effects, and highlight the challenges associated with the use of observational data. We emphasize that experimental thinking is crucial in causal inference. The quality of the data (not necessarily the quantity), the study design, the degree to which the assumptions are met, and the rigor of the statistical analysis allow us to credibly infer causal effects. Although we advocate leveraging the use of big data and the application of machine learning (ML) algorithms for estimating causal effects, they are not a substitute of thoughtful study design. Concepts are illustrated via examples.

measure) on a specific outcome Y (e.g., body weight, depression symptoms, life expectancy, COVID-19 deaths). This paper discusses the fundamental ideas of causal inference under a potential outcome framework (Neyman, 1990;Rubin, 1974Rubin, , 1978 in relation to new data science developments. As statisticians, we focus on study design and estimation of causal effects of a specified, well-defined intervention W on an outcome Y from observational data.
The paper is divided into eight sections: • Sections 1 and 2 -which heuristically contrast randomized controlled experiments with observational studies.
• Section 3 -the design phase of a study, including the illustration of the key assumptions required to define and identify a causal effect.
• Section 4 -a non-technical overview of common approaches for estimating a causal effect, focusing on Bayesian methods.
• Section 5 -advantages and disadvantages of the most recent approaches for machine learning (ML) in causal inference.
• Section 6 -recent methods for estimating heterogeneous causal effects.
• Section 7 -a discussion of the critical role of sensitivity analysis to enhance the credibility of causal conclusions from observational data.
• Section 8 -a concluding discussion outlining future research directions.
1.1. The Potential Outcomes framework is just one of the many popular approaches to causal inference: Before we present our panoramic view of the potential outcome framework for casual inference, we want to acknowledge that causation and causality are scientific areas that span many disciplines including, but not limited, to statistics. Several approaches to causality have emerged and have become popular in the areas of computer, biomedical, and social sciences.
Although the focus of this paper is on the potential outcome framework, we want to stress the importance of acknowledging the other approaches and viewpoints that exist in the general area of causality. A Google search on causality resulted in 45 books (https://www.goodreads.com/shelf/show/causality) of which less than 10 focus on statistical estimation of causal effects. Many of those books provide philosophical views of causal reasoning in the context of different disciplines and a comprehensive overview of causality as a discipline. Texts that discuss causal methods from the potential outcome perspective (even if they are not always exclusive) include Winship (2007, 2014); Angrist and Pischke (2008); Rosenbaum (2002Rosenbaum ( , 2010; Imbens and Rubin (2015); Hernán and Robins (2020). Pearl (2000) and Pearl and Mackenzie (2018) approach to causality has become very popular in computer science (see for example Peters et al., 2017). In particular, the books by Pearl introduce the notion of causal graphs which are graphical representations of causal models used to represent assumptions about potential causal relationships. These graphs represent assumptions regarding causal relationships as edges between nodes. In these instances graphs are used to determine if data can identify causal effects and visually represent assumptions in causal models. The benefits of this approach are particularly evident when multiple variables could have causal relationships in a complex fashion (see also Hernán and Robins, 2020).
Other books cover the causality-philosophical meaningfulness of causation (Cartwright, 2007;Beebee et al., 2009;Halpern, 2016); deducing the causes of a given effect (Dawid et al., 2017); understanding the details of a causal mechanism (VanderWeele, 2015); or discovering or unveiling the causal structure (Spirtes et al., 2001;Glymour et al., 2014). Unfortunately, it is impossible to summarize all of these contributions in a single paper.
For the type of studies we have encountered in our work in sociology, political science, economics, environmental, and health sciences the typical setting is to estimate the causal effect of a prespecified treatment or intervention W on an outcome Y . In these instances we have found the potential outcome approach useful to draw causal inferences. The potential outcome framework is also helpful to bridge experimental and observational thinking.
In this paper we provide a statistical view of the potential outcome framework for causal inference.
We emphasize that there is a lot we can learn from the design of randomized controlled trials (RCTs) for estimating causal effects in the context of observational data. Furthermore, we will stress the importance of quantifying uncertainty around causal effects and how to conduct sensitivity analyses of causal conclusions with regard to violations of key assumptions.

The world of data science is about observational data
Confounding bias is a key challenge when estimating causal effects from observational data.
Let's assume that we are conducting an observational study to estimate the causal effect of a new drug compared to an older drug to lower blood pressure. Because the study is observational, it is highly likely that individuals that took the new drug are systematically different to the individuals that took the older drug with respect to their socioeconomic and health status. For example, it is possible that individuals with a higher income might have easier access to the new treatment and at the same time might be healthier than individuals with low income. Therefore, if we compare individuals taking the new drug to individuals taking the older drug without adjusting for income, we might conclude that the new drug is effective, when instead the difference we observe in blood pressure is due to individuals taking the new drug being richer and healthier to begin with.
For now, we can assume that a variable is a potential confounder if it is a pre-treatment characteristic of the subjects (e.g. income) that is associated with the treatment (e.g. getting the new drug) and also associated with the outcome (e.g. blood pressure) 1 .
In our second example we define the treatment and the outcome as follows: • Treatment -getting a dog W = 1, not getting a dog W = 0 • Outcome -severe depression symptoms Y = 1, mild depression symptoms Y = 0 measured after the treatment assignment W Confounders could mask or confound the relation between W and Y which complicates causal attribution or leads to potentially incorrect inferences. For the depression/dog example (Figure 1), a potential confounder is the severity of depression symptoms (denoted by X) before treatment assignment. It is reasonable to believe that individuals with severe symptoms of depression pretreatment (X = 1) are more likely to adopt a dog (W = 1) than people with mild symptoms of depression (X = 0). Furthermore, individuals with severe symptoms of depression before the treatment assignment (X = 1) are more likely to have severe symptoms of depression after the treatment assignment Y , than individuals with mild symptoms of depression (X = 0).
RCTs are the gold standard study design used to estimate causal effects. To assess the causal effect on survival of getting a new drug compared to a placebo, we could randomize half of the patients enrolled in our study. Half would receive the new drug (W = 1), and the other half would receive a placebo (W = 0). Randomization is particularly important to establish the efficacy and safety of drugs (new and existing) (Collins et al., 2020). This is because randomizing patients eliminates systematic differences between treated and untreated observations. In other words, randomization ensures that these two sets of observations are as similar as possible with respect to all potential confounders, regardless of whether we measure these potential confounders, and are identical on average. If the distribution over the measured and unmeasured confounders are the same in the two groups, then we can use the treated observations to infer what would have happened to the untreated observations. Unfortunately, randomization is often not possible, either because there are ethical conflicts (such as exposure to environmental contaminants) or because it is challenging to implement. In the latter case, the most constraining factors are the time and monetary expense of data collection.
Additional limitations of randomization include inclusion criteria that are too strict and cannot study large and representative populations (Athey and Imbens, 2017). Moreover, inclusion criteria usually focus on simplified interventions (e.g., randomization to a drug versus placebo) that do not mirror the complexity of real-world decision making. While the credibility (internal validity) and ability to advance scientific discovery of RCTs is well accepted (e.g., 2019 Nobel Memorial Prize in Economic Sciences, Duflo et al., 2007;Banerjee et al., 2015), there are large classes of interventions and causal questions for which results that have a causal interpretation can only be gathered from observational data.
Fortunately, in this new era of data science, we have access to significant observational data. We can, for example, easily identify large and representative populations of cancer patients, and determine from medical records who received standard therapy or new therapy (or multiple concomitant therapies). Additionally, we can ascertain age, gender, behavioural variables, income, health status before treatment assignment and assess cancer recurrence and survival (Arvold et al., 2014). However, because we observe who receives treatment instead of randomizing who receives treatment, the treated and untreated sub-populations are likely to be systematically different with Figure 1. Top panel: rate of experiencing depression symptoms one year after getting a dog W = 1 (83%) and one year after not getting a dog W = 0 (78%). Bottom panel: rate of experiencing severe depression symptoms one year after getting a dog W = 1 and one year after not getting the dog W = 0, but separately for the two sub-populations that experienced severe or mild symptoms of depression before treatment assignment, denoted by X = 1 and X = 0, respectively. In the top panel, where we do not stratify by levels of X, getting a dog seems to increase the degree of severity of depression symptoms. In the bottom panel, where we stratify by levels of X, getting a dog appears to decrease the severity of depression symptoms. respect to each of the potential confounders. Without adjusting for systematic differences between the treated and untreated populations, our inference on causal effects will be biased. Given the significant amounts of available data, it is tempting to use correlations observed in the data as evidence of causation; but strong correlation can lead to misleading conclusions when quantifying causal effects. Figure 1 provides an example of confounding. Let's assume we compare two samples: one that adopts a dog (W = 1) and one that does not (W = 0). Then within each of these two populations, we calculate the rate of experiencing severe symptoms of depression Y = 1. We found that adopting a dog appears to make the symptoms of depression worse: if you have a dog you are 5% more likely (83% versus 78%) to experience severe symptoms of depression. Should we advise people not to own dogs? The problem with this analysis is that we ignore the fact that the subjects might be different in ways that would bias the conclusions. As mentioned before, a key potential confounder is the degree of severity of their depression symptoms before they were assigned the treatment (X). For example, let's stratify the two populations (treated and untreated) based on whether they experience severe or milder depression symptoms of (X = 1 versus X = 0) before treatment assignment. We find that within these two population strata, adopting a dog reduces the rate of experiencing severe symptoms of depression. This is an example of what is known as Simpson's paradox. Here the paradox occurs because people with severe depression symptoms before treatment assignment are more likely to adopt a dog.
If we define by e i = P (W i |X i ) the propensity of adopting a dog conditional to the level of depression symptoms pre-treatment assignment, then in this example P (W i = 1|X i = 1) = 772/(772 + 249) is higher than P (W i = 1|X i = 0) = 228/(228 + 751). In other words, the assignment to treatment, who gets a dog and who does not, is not completely random, as in a RCT. It is influenced by the pre-existing level of depression of the study subjects. Situations like these are very common in observational studies. We argue that the potential outcome (PO) framework detailed below allows us to design an observational study and clarify the assumptions that are required to estimate the causal effects in these studies. Such assumptions translate expert knowledge into identifying conditions that are hard, if not impossible, to verify from data alone.
The PO framework is rooted in the statistical work on randomized experiments by Fisher (1918Fisher ( , 1925 and Neyman (1990), extended by Rubin (1974Rubin ( , 1976Rubin ( , 1977Rubin ( , 1978Rubin ( , 1990) and subsequently by others to apply to observational studies. This perspective was called Rubin's Causal Model by Holland (1986) because it viewed causal inference as the result of missing data, and proposed explicit mathematical modeling of the assignment mechanism to reveal the observed data.

The Design Phase of a Study
We begin this section by describing the key distinctions between an RCT and an observational study. Table 1 summarizes and contrasts the main differences between RCTs and observational studies, and includes guidance on how to conduct causal inference in the context of observational studies in the last column.
The appeal of the RCT is that the design phase of the study (e.g. units, treatment, and timing of the assignment mechanism) is clearly defined a priori before data collection, including how to measure the outcome. In this sense, the RCT design is always prospective: treatment assignment randomization always occurs before the outcome is measured. A key feature of RCTs is that the probability of getting the treatment or the placebo, defined as the propensity score, is knownunder the experimenter's control -and it does not depend on unobserved characteristics of the study subjects. Randomization of treatment assignment is also fundamentally useful because it balances observed and unobserved covariates between the treated and control subjects. Once the design phase is concluded, the experimenter can proceed with the analysis phase, that is, estimating the causal effects based on a statistical analysis protocol that was pre-selected without looking at the information on the outcome. The separation between design and analysis is critical as it guarantees objective causal inference. In other words, it will prevent the experimenter from picking and choosing the estimation method that would lead to their preferred conclusion .
In observational studies the treatment conditions and the timing of treatment assignment are observed after the data have been collected. The data are often collected for other purposes -not explicitly for the study. As a result the researcher does not control the treatment assignment mechanism. Moreover, the lack of randomization means there is no guarantee that covariates are balanced between treatment groups which could result in systematic differences between the treatment group and the control group. Traditionally, practitioners do not draw a clear distinction between the design and analysis phase when they analyze observational data. They estimate causal effects using regression models arbitrarily choosing covariates. This lacks the clear protocol of RCTs. To make objective causal inference from observational studies, we must address these challenges. Luckily, it is possible to achieve objective causal inferences from observational studies with careful design that approximates a hypothetical, randomized experiment. A carefully designed observational study can duplicate many appealing features of RCTs, and provide an objective inference on causal effects (Rubin, 2007;Rubin et al., 2008;Hernán and Robins, 2016). In the context of causal inference, the design of an observational study involves at least three steps (presented below). These steps should be followed by an analysis phase where the estimation approach is defined according to a pre-specified protocol as in the RCT. The three steps in the design phase are: (1) define experimental conditions and potential outcomes (subsection 3.1); (2) define causal effect of interest, including assumptions for identifiability (subsection 3.2); (3) construct a comparison group (subsection 3.3).
3.1. Define the experimental conditions and the potential outcomes. The first step to address a causal question is to identify the conditions (actions, treatments, or exposure levels) required to assess causality. To define the causal effect of W = 1 versus W = 0 on Y , one must postulate the existence of two potential outcomes: Y (W = 1) and Y (W = 0). As the name implies, both variables are potentially observable, but only the variable associated with the observed (or assigned) action will be observed. The critical feature of the notion of a cause is that the value of W for each unit can be manipulated. To illustrate this idea, we now introduce the subscripts (t − 1) and t to indicate time at or before the treatment assignment and time after treatment assignment (e.g., one year later). For example, let's say Amy (subject i at a specific time t − 1) adopted a dog (W i,t−1 = 1) and we observe her depression symptoms one year later t (Y i,t (W i,t−1 = 1) = Y i,t (1)). Now we must assume that W can be manipulated, that is, we must be able to hypothesize a situation were Amy would not get a dog ( 3.2. Define the causal effect of interest. We define a unit-level causal effect as the comparison between the potential outcome under treatment and the potential outcome under control. For As we can see from Figure 2, the causal effect of interest, is not a pre/post comparison of the depression symptoms for Amy (defined as Y i,t (1) − X i,t−1 ). It is, instead the difference between her two potential outcomes evaluated at time t, defined as whereas Y i,t (0) is not. These differences are called individual treatment effects (ITE) (Rubin, 2005).  Dahabreh et al., 2016) and facilitate decision making in individualized settings where an estimate of the causal effect averaged across all the subjects in the sample may not be very practical (e.g., Li et al., 2019). We will return to this issue in Section 5. The last column of Table 2 introduces the concept of summarizing individual causal effects across the population of interest. For example, we can summarize the unit-level causal effects by taking the average difference, or by taking the difference in median values of the Y i (1)s and Y i (0)s, respectively. We typically focus on causal estimands that contrast potential outcomes on a common set of units (our target sample of size N ), for example the average treatment effect (ATE): 1 . The fundamental challenge is that we will never observe a potential outcome under a condition other than the one that actually occurred, so that we will never observe an individual causal effect (see Table 3). Holland (1986) refers to this as the fundamental problem of causal inference. We typically refer to the missing potential outcome as the counterfactual and to the observed outcome as the factual outcome. Causal inference relies on the ability to predict the counterfactual outcome.
It is important to note that methods used to predict or impute counterfactuals are different than off-the-shelf prediction or imputation often used for missing values. This is because we will never be able to find data where both potential outcomes (Y i (1), Y i (0)) are simultaneously observed on Table 3. What we are able to observe about the Science

Units Covariates Treatment Potential Outcomes
Unit a common set of units. Table 3 highlights other fundamental implications of this representation of causal parameters: a) Uncertainty remains even if the N units are our finite population of interest, because of the missingness in the potential outcomes; b) The inclusion of more units provides additional information (more factual outcomes) but also increases missingness (more counterfactual outcomes).
Data alone is not sufficient to predict the counterfactual outcome. We need to introduce several assumptions that essentially embed subject matter expert knowledge (Angrist and Pischke, 2008). This is why machine learning alone cannot resolve causal inference problems, an issue discussed further in Section 5. To identify a causal effect from the observed data we have to make several assumptions.
Assumption 1: Stable Unit Treatment Value Assumption (SUTVA). SUTVA, introduced and formalized in a series of papers by Rubin (see Rubin, 1980Rubin, , 1986Rubin, , 1990, requires that there is no interference and no hidden version of the treatment. No interference assumes that the potential outcomes of a unit i only depend on the treatment unit i receives, and are not affected by the treatment received by other units. For example epidemiological studies of the causal effects of non-pharmaceutical interventions (e.g., stay-at-home advisory) on the probability of getting COVID19 violate the assumption of no interference. This is because the individual level outcome (whether or not a subject is infected) depends on whether he/she complies with the stay-at-home advisory, but also on whether or not others in the same household also comply with the stay-at-home advisory.
when the observations are correlated in time or space. In our ice cream case study this assumption is likely to hold as it is reasonable to assume that the only person who benefits from weight loss is the person that stopped eating ice cream. SUTVA violations can be particularly challenging in air pollution regulation studies as pollution moves through space and presents a setting for interference.
Intervening at one location (e.g., a pollution source) likely affects pollution and health across many locations, meaning that potential outcomes at a given location are probably functions of local interventions as well as interventions at other locations (Papadogeorgou et al., 2019;Forastiere et al., 2020). The condition of no hidden version of treatments requires that potential outcomes not be affected by how unit i received treatment. This assumption is related to the notion of consistency (Hernán, 2016;Hernán and Robins, 2020). For example, how Amy adopted a dog (a friend giving away puppies or driving to a breeder) does not affect Amy's outcome.
Our ability to estimate the missing potential outcomes depends on the treatment assignment mechanism. That, is, it depends upon the probabilistic rule W = 1 versus W = 0 which determines The assignment mechanism is defined as the probability of getting the treatment conditional on X, Y (1), Y (0), e.g. P (W |X, Y (1), Y (0)). This expression will be simplified under the next assumption.
Assumption 2: No unmeasured confounding. The assignment mechanism is unconfounded if: P (W | X, Y (1), Y (0)) = P (W | X). Unconfoundedness is also known as no unmeasured confounding assumption, or conditional independence assumption. This means that if we can stratify the populations within subgroups that have the same covariate values (e.g., same age, gender, race, income), then within each of these strata, the treatment assignment (e.g., who gets the drug and who does not) is random.
The assumption allows to provide a formal definition of a confounder. Although there is no consensus regarding a unique and formal definition, we adopt the one proposed by VanderWeele and Shpitser (2013): a pre-exposure covariate X is said to be a confounder for the effect of W on Y if there exists a set of covariates X * such that the effect of W on Y is unconfounded conditional on (X * , X), but it is not for a subset of (X * , X). Equivalently, a confounder is a member of a minimally sufficient adjustment set.
Unconfoundedness is critical to estimate causal effects in observational studies. As Y i (1) is never observed on subjects with W i = 0 and Y i (0) is never observed on subjects with W i = 1, we cannot test this assumption, and so its plausibility relies on subject-matter knowledge. As a result, sensitivity analysis should be conducted routinely to assess how the conclusions will change under specific deviations from this assumption (discussed in Section 6). Moreover, this assumption may fail to hold if some relevant covariates are not observed, or if decisions are based on information on potential outcomes. For example a perfect doctor (Imbens and Rubin, 2015) gives a drug to patients based on who benefits from the treatment (e.g., : the assignment is confounded -i.e. depends on the potential outcomes, irrespective of the covariates we are able to condition on. We discuss these situations in our final remarks. Assumption 3: Overlap or positivity. We define the propensity score for subject i as the probability of getting the treatment given the covariates (Rosenbaum and Rubin, 1983) The assumption of overlap requires that all units have a propensity score that is between 0 and 1, that is, they all have a positive chance of receiving one of the two levels of the treatment.
In the depression/dog example, this may be violated if some people in the population of interest are allergic to dogs and therefore their probability of getting a dog is zero. In the clinical example, this hypothesis is commonly violated if a patient has a genetic mutation that prevents him/her from receiving the treatment being tested. Because the propensity score can be estimated from data, we can check if overlap holds. If for some units the estimated e i is very close to either 1 or 0, then these units are only observed under a single experimental condition and therefore contain very little information about the causal effect. In this situation, strong assumptions are necessary.
For example, a strong assumption is to say that the functional form that relates the covariates with the outcome holds also outside of the observed range of the covariates. A more formal approach of how to overcome violations of the positivity assumption is presented in Nethery et al. (2019). If assumptions 2 and 3 are both met, then we conclude that the assignment mechanism is strongly ignorable (Rosenbaum and Rubin, 1983). Classic randomized experiments are special cases of strongly ignorable assignment mechanisms.
3.3. How to construct an adequate comparison group. Once you have identified relevant potential confounders, and assuming they are sufficient for unconfoundedness to hold, the issue of confounding can be resolved by constructing an adequate comparison group. This is a crucial step in the design of an observational study. Our goal is to synthetically recreate a setting that is very similar to a randomized experiment, so the joint distribution of all the potential confounders is as similar as possible between the treatment and control groups (Ho et al., 2007;Stuart and Rubin, 2008;Stuart, 2010). For instance, let's return to our example about Amy (in this case we drop income, health status, severity of depression symptoms) before treatment assignment. The only difference between the matched subjects and Amy is that they did not get a dog. This should be done for Amy and for any subject in our target population who got a dog. With a large number of confounders matching on all confounders exactly may not be feasible. A common approach to address this challenge is to use propensity scores (e i ) and match subjects with respect to e i . The estimated propensity score is an univariate summary of all covariates and is crucial to estimate causal effects under unconfoundedness Imbens and Rubin, 2015;Imai and Ratkovic, 2014). Subjects sharing the same value of the propensity score have the same distribution of the observed potential confounders whether they are treated or not. Estimated propensity scores can be applied in the design phase to assess overlap and construct a comparison group through matching, stratifying, or weighting observations .
Covariate balance can also be viewed as an optimization problem. Procedures based on this idea either directly optimize weights or find optimal subsets of controls such that the mean or other characteristics of the covariates is the same in the treatment and control group (Zubizarreta, 2012;Diamond and Sekhon, 2013;Zubizarreta et al., 2014a;Zubizarreta, 2015;Li and Thomas, 2018).
To this point we have only discussed what we are interested in estimating, the study design, and the problem of confounding. Now we will discuss how we estimate the causal effects. Being able to identify causal effects is a feature of the causal reasoning used in the potential outcome framework.
Without including causal reasoning in the design phase you cannot to recover the causal effect even with the most sophisticated machine learning or nonparametric methods (e.g., Mattei and Mealli, 2015).

Estimation
Causal estimands such as AT E = Y (1) − Y (0) are a function of the observed Y obs and the missing potential outcomes (the counterfactuals) Y mis . Therefore, an estimation strategy needs to implicitly or explicitly impute Y mis . What follows is not a comprehensive review of all estimation methods of causal effects, but rather key ideas of Bayesian estimation of the average treatment effect (Rubin, 1978;Imbens and Rubin, 2015;Ding and Li, 2018).

4.1.
Bayesian methods for the imputation of the missing counterfactuals after the design phase is concluded. Within the model-based Bayesian framework for causal inference (Rubin, 1975(Rubin, , 1978, the Y mis are considered unknown parameters. The goal is to sample from their posterior predictive distribution conditionally to the observed data defined as: where P (X, Y (1), Y (0)) denotes the model for the potential outcomes, while P (W |X, Y (1), Y (0)) denotes the model for the treatment assignment. By sampling from this posterior distribution we can multiply impute Y mis , and then estimate AT E, or any other causal contrast, and its posterior credible interval (Rubin, 1978;Mealli et al., 2011). Note that this missing data imputation exercise is critically different from a usual prediction task: the two models introduced above contain expert knowledge (for example the assumption of strong ignorability or the inclusion of relevant covariates) that cannot be retrieved from data alone. Under unconfoundedness or no unmeasured confounding (Assumption 2: P (W |X, Y (1), Y (0)) = P (W |X)), and assuming the parameters of the model for the potential outcomes are a priori independent of the parameters of the model for the assignment mechanism, then the posterior distribution of the missing potential outcomes only depends on the parameters of the model for the outcomes. Specifically, assuming exchangeability (de Finetti, 1963), there exists a parameter vector θ having a known prior distribution p(θ) such that: The posterior predictive distribution of the missing data, P (Y mis | Y obs , W, X), can be written as Let θ W |X , θ Y |X and θ X , denote the unknown parameters corresponding to the distribution of the treatment assignment mechanism (e.g. the propensity score), the distribution of potential outcomes, and the distribution of covariates, respectively. Then, given ignorability, the propensity score P (W i | X i , θ W |X ) and the covariates' distribution P (X i | θ X ) cancel out in Equation (3), which simplifies to: Equation (4) shows that, under ignorable treatment assignments, the potential outcome model needs to be specified P (Y i (w) | X i , θ Y |X ) for w = 0, 1, as well as the prior distribution p(θ Y |X ), to derive the posterior distribution of the causal effects. 2 Therefore, the most straightforward Bayesian approach to estimate causal effects under ignorability is to specify models for Y (1) and Y (0) that are conditional to covariates and some parameters and then draw the missing potential outcomes from their posterior predictive distribution, which will also derive the posterior distribution of any causal estimand.
As such, it seems like propensity scores, that are central to balance the covariates in the design stage, do not affect Bayesian inference for causal effects under ignorability. However, as noted by Rubin (1985), for effective calibration of Bayesian inference of causal effects (i.e., good frequentist 2 Please refer to Section 7 in Rubin (1990) for a specific example, and Imbens and Rubin (1997) for the dependence of the posterior distribution of causal effects on association parameters in the joint distribution of Y (1) and (Y (0). The discussion is beyond the scope of this paper.
properties) good covariate balance is necessary (Ding and Li, 2019, see also

4.2.
Bayesian methods for joint estimation of the outcome and propensity score models and the feedback problem. Some Bayesian methods explicitly include the estimation of the propensity score in the causal effects' estimation procedure. Some of these approaches involve the specification of a regression model for the outcome with the propensity score as a single regressor, arguing that this modelling task is simpler than specifying a model for the outcome conditional on the whole set of (high-dimensional) covariates (Rubin, 1985). This approach can be improved by adjusting for the residual covariance between X and Y at each value of e(X) (Gutman and Rubin, 2015), or by specifying an outcome model conditional on the covariates X and the propensity score (Zheng and Little, 2005).
These are two-stage methods that separate design (estimation of the propensity score) from analysis. Some authors have proposed a single step Bayesian approach to merge the two stages.  , stratification or weighting (Saarela et al., 2015). These approaches are not fully Bayesian in that they incorporate only the uncertainty in the propensity score estimation and not in the imputation of the missing potential outcomes. The frequentist methods have been improved by combining them with outcome regression adjustments (e.g., Abadie and Imbens, 2011).
Robins and colleagues (e.g., Scharfstein et al., 1999;Robins, 2000;Lunceford and Davidian, 2004;Bang and Robins, 2005;Funk et al., 2011;Knaus, 2020) have proposed a class of double robust (DR) estimators that combine inverse probability weighting estimator (IPW) with an outcome regression. Interestingly, Gustafson (2012) casted DR estimator from a Bayesian perspective as a weighted average of a parametric model and a satured model for the outcome conditional on covariates, with weights that depend on how well the parametric model fits the data. As a result, it can also be viewed as a Bayesian model average estimator (Cefalu et al., 2016).

Bayesian and non-Bayesian methods to account for variable selection either in the
propensity or in the outcome model. Zigler and Dominici (2014) proposed a Bayesian model averaging method to adjust for the uncertainty in the selection of the covariates in the propensity score model, extending Wang et al. (2012); see also Wang et al. (2015). When the number of the potential confounders is larger than the number of observations, approaches for dimension reduction and penalization are required. The standard approaches (e.g., the Lasso) generally aim to predict the outcome, and are less suited for estimation of causal effects. Under standard penalization approaches, if a variable X is strongly associated with the treatment W but weakly associated with the outcome Y , its coefficient will be reduced towards zero leading to confounding bias. Belloni et al. (2014b,a) proposed a modified version of Lasso, called double Lasso, to reduce confounding bias. There are several Bayesian alternatives that usually outperform such approaches, for example using continuous spike and slab priors on the covariates' coefficients in the outcome and propensity score models (Wang et al., 2015;Cefalu et al., 2016;Antonelli et al., 2017;Antonelli and Dominici, 2018).

The perils and the strengths of machine learning methods in causal inference
We have presented approaches for the estimation of Y mis with high dimensional covariates and nonparametric methods, and the related issue of variable selection. At first glance, these tasks could be addressed by implementing off-the-shelf machine learning methods, but there are challenges. In this section we provide critical insights regarding the application of machine learning methods in causal inference.
Machine learning methods primarily address prediction or classification problems, and have been included in statistical textbooks (e.g., Hastie et al., 2009;James et al., 2013;Efron and Hastie, 2016). On a high level, there are two broad categories of machine learning methods: supervised and unsupervised learning. In supervised learning, the predictors (i.e. covariates, features) X and the outcome Y are both observed. The goal is to estimate the conditional mean of an outcome Y given a set of covariates or features X, to ultimately predict Y . These methods include decision trees (e.g. Breiman et al., 1984), random forests (Breiman, 2001), gradient boosting (Friedman, 2001), support vector machines (Cortes and Vapnik, 1995;Suykens and Vandewalle, 1999), deep neural networks (e.g., LeCun et al., 2015;Farrell et al., 2018), ensemble methods (e.g., Dietterich, 2000), and variable selection tools such as LASSO (Tibshirani, 1996;Hastie et al., 2015). Regression trees, and random forests as their extension, have become very popular methods for estimating regression functions in settings where out-of-sample predictive power is important. When the outcome Y is an unordered discrete response, these supervised learning algorithms attempt to resolve classification problems -for example detecting spam emails. In this instance, a machine learning algorithm is trained with a set of spam-emails labelled as spam and not-spam emails labelled as not-spam, so that a new email can be classified as either spam or not-spam. In unsupervised learning (see Hastie et al., 2009) only features X are observed and the goal is to group observations into clusters (Jain et al., 1999). Clustering algorithms essentially group units based on their mathematical similarities and dissimilarities of features X. These tools can be used for example to find groups of basketball or soccer players with similar attributes and then interpret and use these clusters to form teams or to target coaching. Deep learning methods are another general and flexible approach to estimate regression functions. They perform very well in settings with extremely large number of features, like image recognition or image diagnostics (He et al., 2016;Simonyan and Zisserman, 2014). These methods typically require a large amount of tuning to work well in practice, relative to other methods such as random forests, and as a result we will not discuss them further in this article.
Since machine learning supervised learning methods aim to estimate the conditional mean of an outcome Y given X it would on the surface appear to be a good fit to exploit machine learning methods to estimate the missing potential outcomes and as a result the causal effects, especially in high-dimensional settings. However, it is not that simple. The following section presents the circumstances under which off-the-shelf supervised learning machine learning methods might not be appropriate with regard to causal inference. We also discuss how these methods can be adapted to achieve estimation of causal effects; how machine learning and the literature on statistical causal inference can cross-fertilize; and the open questions and problems that machine learning cannot handle on its own. 5.1. Why off-the shelf machine learning techniques might not be appropriate for causal inference. A key distinction between causal inference and machine learning is that the former focuses on estimation of the missing potential outcomes, average treatment effects, and other causal estimands, and machine learning focuses on prediction and classification. Therefore machine learning dismisses covariates with limited prediction importance. However, in causal inference if these same covariates are correlated with the treatment assignment they can be important confounders.
As previously discussed omitting covariates from the analysis that are highly correlated with the treatment can introduce substantial bias in the estimation of the causal effects even if their predicting power is weak. Another major difference is that machine learning methods are typically assessed on their out-of-sample predictive power. This approach has two major drawbacks in the context of causal inference. First, as pointed out by Athey and Imbens (2016), a fundamental difference between a prediction and the estimation of a causal effect is that in causal inference we can never observe the ground truth (e.g., in this context the counterfactual). That is, in our example, because Amy has adopted a dog, we will never be able to measure the severity of the Amy's depression symptoms under the alternative hypothetical scenario where Amy did not adopt a dog. Therefore, standard approaches for quantifying the performance of machine learning algorithms cannot be implemented to assess the quality of prediction of the missing potential outcomes and therefore the causal effects . Second, in causal inference we must provide valid confidence intervals for the causal estimands of interest. This is required to make decisions regarding which treatment or treatment regime is best for a given unit or subset of units, and whether a treatment is worth implementing. Another limitation of machine learning techniques in causal inference is that they are developed for the most part in settings where the observations are independent and therefore have limited ability to handle data that is correlated in time and/or in space, such as time series, panel data, spatially correlated processes. Additionally, they are not able to handle specific structural restrictions suggested by subject-matter knowledge, such as monotonicity, exclusion restrictions, and endogeneity of some variables (Angrist and Pischke, 2008;Athey and Imbens, 2019).

5.2.
How machine learning methods can adapt to the goal of estimation of causal effects. Machine learning methods have been adapted to address causal inference. One popular approach is to redefine the optimization criteria, which typically depend on a function of the prediction errors, to prioritize issues arising in causal inference, such as the controlling for confounders and the discovering of treatment effect heterogeneity (Chernozhukov et al., 2017(Chernozhukov et al., , 2018. For example, the causal tree method proposed by Athey and Imbens (2016) is based on a rework of the criterion function of Classification and Regression Trees (Breiman et al., 1984) -originally aimed at minimising the predictive error -to maximise the variation in the treatment effects and, in turn, discover the subgroups with the highest heterogeneity in the causal effects (further details are presented in Section 6). Typical regularizing algorithms used in machine learning, such as Lasso, Elastic Net and Ridge (Hastie et al., 2015), must prioritize confounding adjustment to avoid missing relevant covariates, as seen in Belloni et al. (2014b) and Belloni et al. (2014a) and discussed in Specifically, it is critical to first identify a good control population, for example using propensity score matching as discussed earlier. After we have achieved this critically important task, then we can be confident that the assumptions of identifiability of causal effects are met. Only at this stage, the machine learning techniques provides an excellent tool to predict the missing counterfactuals (e.g., Chernozhukov et al., 2018). However, it is important to keep in mind that performance will ultimately rely on the flexible parametrization that machine learning methods impose on the data, the plausibility of the unconfoundedness assumption, and the extent of the overlap in the distribution of the covariates (Hernán et al., 2019). Even in the presence of a high-dimensional set of covariates, the study design is important (D'Amour et al., 2017).

5.3.
Cross-fertilization between machine learning and causal inference problems. Several machine learning techniques have been adapted to improve traditional causal inference methods. These include approaches to regularization (the process of adding information to solve ill-posed problems or to prevent overfitting) that scale well to large datasets (Chang et al., 2018) and the related concept of sparsity (the idea that some variables may be dropped from the analysis without affecting the performance of estimation of causal effects). The use of model averaging and ensemble methods common in machine learning is a practice that is now exploited in causal inference (see Section 4) ( Van der Laan and Rose, 2011;Cefalu et al., 2016).
Framing data analysis as an optimization problem has inspired the development of causal inference methods based on direct covariate balancing. For example, Zubizarreta (2015) which optimizes weights for each observation instead of trying to estimate the propensity score so that the covariate distribution is the same in the treatment and control group. Matrix completion methods were originally developed in machine learning for the imputation of the missing entries of a partially observed matrix (Candès and Recht, 2009;Recht, 2011). These methodologies can be used to improve causal inference methods for panel data and synthetic control methods (Abadie et al., 2010) in settings with large N and T . In particular, matrix completion can be successfully adapted for the imputation of missing counterfactuals when a large proportion of potential outcomes is missing .

5.4.
Open questions and problems that machine learning alone cannot handle. However, even when adapted to treatment effect estimation, machine learning algorithms must be implemented with extreme caution when there are unresolved key issues surrounding the study design. For example: • The sample under study is not representative of the population about which we need or want to draw conclusions; • The number of potential confounders that are measured may not be sufficient: there is nothing in the data that tells us that unconfoundedness holds; causal effect estimation should be followed by a well-designed sensitivity analysis; • The presence of post-treatment variables that must be excluded (i.e., variables that can be affected by the treatment and are strong predictor of the outcome); • The lack of overlap (Nethery et al., 2019;D'Amour et al., 2017) in the distribution of the estimated propensity scores for the treated and untreated, demands the machine extrapolate beyond what is observed; • If interference is detected, causal estimands (direct and indirect -spillover -effects) need to be re-defined and different estimation strategies need to be implemented. (Arpino and Mattei, 2016;Forastiere et al., 2016;Papadogeorgou et al., 2019;Tortú et al., 2020).
6. Treatment effect heterogeneity: is the treatment beneficial to everyone?
Suppose we found statistically significant evidence that a new drug prolonged life expectancy on average for the population under study. Should we encourage anyone, regardless of their age, income, or other diseases to take this new drug? It is often highly desirable to characterize which subgroups of the population would benefit the most or the least from a treatment. These types of questions require us to analyze treatment effect heterogeneity based on pre-treatment variables.
There is extensive literature on assessing heterogeneity of causal effects that is based on estimating the conditional average treatment effect (CATE), which is defined as where Y (W = 1) | X = x and Y (W = 0) | X = x are the potential outcomes in the subgroups of the population defined by X = x. Conditionally to X = x, the CATE can be estimated with the same set of the causal assumptions that are needed for estimating the ATE (Athey and Imbens, 2016). Recently, machine learning methods such as random forests, Bayesian Additive Regression Tree (BART) (Chipman et al., 2010), and forest based algorithms Foster et al. (2011) andHill (2011), have been used to estimate CATE, especially in the presence of high dimensional X. Despite accurately estimating the CATE using machine learning methods, these methods offer little guidance about which population subgroups are important in the treatment effect heterogeneity. Their parametrization of the covariate space is complicated and difficult to interpret even by human experts. We define interpretability as the degree to which a human can understand the cause of a decision or consistently predict the results of the model (Miller, 2019;Kim et al., 2016). Decision rules fit well with this non-mathematical definition of interpretability. A decision rule consists of a set of conditions about the covariates, that define a subset of the features space, and correspond to a specific subgroup. In our recent work (Lee et al., 2020), we propose a novel causal rule ensemble (CRE) method that ensures interpretability of the causal rules while maintaining a high level of accuracy of the estimated treatment effects for each rule. We argue that in the context of treatment effect heterogeneity, we want to achieve at least two main goals: to (1) discover de novo the rules (that is, interpretable population subgroups) that lead to heterogeneity of causal effects, and (2) make valid inference about the CATE with respect to the newly discovered rules. Athey and Imbens (2016) has introduced a clever approach to make valid inferences in this context. They introduced the idea of a sample-splitting approach that divides the total sample into two smaller samples: (1) to discover a set of interpretable decision rules that could lead to treatment effect heterogeneity (i.e. discovery sample), and (2) to estimate the rule-specific treatment effects and associated statistical uncertainty (i.e. inference sample). This is a very active area of research and one where the integration of machine learning and causal inference could provide important advances.

Sensitivity analysis
Sensitivity analyses can be conducted to bound the magnitude of the causal effects as a function of the degree to which the assumptions are violated to evaluate the robustness of causal conclusions to violations of unverifiable assumptions (Imbens and Rubin, 2015;Ding and VanderWeele, 2016 (1987b, 2002); Imbens (2003) (2019). Software has been developed to perform sensitivity analyses that assesses the strength of conclusions to violations of unverifiable assumptions (Gang, 2004;Nannicini, 2007;Keele, 2014;Ridgeway et al., 2004).
One type of sensitivity analysis is so-called placebo or negative control (Imbens and Rubin, 2015).
Here the goal is to identify a variable that cannot be affected by the treatment. This variable is then used as an outcome (say Y ) and the causal effect of W on Y is estimated, when we know that the true causal effect, say ∆, should be zero. If we estimate that ∆ is statistically significantly different from zero, when we have adjusted for all the measured confounders, then we can conclude that there is unmeasured confounding bias. For additional details see Schuemie et al. (2020).
Sometimes we can check the quality of an observational control group by exploiting useful comparisons. For example, we can have access to two different pools of controls and use them both to check robustness of results to the use of either one (Rosenbaum, 1987a). We can also assess the plausibility of unconfoundedness and the performance of methods for causal effects in observational studies by checking the extent to which we can reproduce experimental findings using observational data. For example, Dehejia and Wahba (2002) Ding and Miratrix (2015). In addition, it is common to include as many covariates as possible to control for confounding bias; however M-bias is a problem that may arise when the correct estimation of treatment effects requires that certain variables are not adjusted for (i.e., are simply neglected from inclusion in the model). The topic of over adjustment is also broadly discussed in the causal graphical models literature (see Pearl, 1995;Shpitser et al., 2010;VanderWeele and Shpitser, 2011;Perkovic et al., 2017).

Discussion
This article has focused on the potential outcome framework as one of the many approaches to define and estimate the causal effects of a given intervention on a outcome. We have presented: • Thoughts regarding the central role of the study design when estimating causal effects (Table 1).
• Why machine learning is not a substitute for a thoughtful study design, and that it cannot overcome data quality, missing confounders, interference, or extrapolation.
• That machine learning algorithms can be very useful to estimate missing potential outcomes after issues related to study design have been resolved.
• Machine learning algorithms show great promises in discovering de novo subpopulations with heterogeneous causal effects.
• The importance of sensitivity analysis to assess how the conclusions are affected by deviations from identifying assumptions.
This review is focused on data science methods for prospective causal questions, that is, on assessing the causal effect of a particular, sometimes hypothetical, manipulation. This is a different goal than that of causal discovery which investigates the causal structure in the data, without starting with a specified causal model. To read more about causal discovery we refer to Glymour et al. (2014); Mooij et al. (2016); Spirtes and Zhang (2016).
We briefly outlined the Bayesian approach as one of the many statistical methods to estimate missing potential outcomes and the ATE. Alternative approaches such as those proposed by Fisher and Neyman are based on the randomization distribution of statistics induced by a classical randomized assignment mechanism (Neyman, 1990;Fisher, 1937) and their sampling distribution. The key feature of these approaches is that potential outcomes are treated as fixed but unknown, while the vector of treatment assignments, W , and the sampling indicators are the random variables. The concepts created by these methods, p-values, significance levels, unbiased estimation, confidence coverage, remain fundamental today (Rubin, 2010). However, we believe that the Bayesian thinking provides a straightforward approach to summarize the current state of knowledge in complex situations, to make informed decisions about which interventions look most promising for future application and to properly quantify uncertainty around such decisions.
There are settings or instances where adjusting for measured covariates is not enough, that is, we cannot rule out dependence of the assignment mechanisms on the potential outcomes. In these irregular settings, another important area of research is relying on identification strategies that differ from strong ignorability, some of which are called quasi-experimental approaches. A natural experiment can be thought of as an observational study where treatment assignment, though not randomized, seems to resemble random assignment in that it is haphazard, not confounded by the typical attributes that determine treatment assignment in a particular empirical field (Zubizarreta et al., 2014b). An example is instrumental variable (IV) methods (Angrist et al., 1996) where a variable, the instrument, plays the role of a randomly assigned incentive of treatment receipt and can be used to estimate treatment effect with a non-ignorable treatment assignment. For example, suppose we want to study the causal effect of an additional child on female labour supply; fertility decisions are typically endogenous and plausibly determined by observed and unobserved characteristics. Angrist and Evans (1998) used the sex of the first two kids as an instrument for the decision to have a third child, and estimated the effect of having an additional child on a women's employment status. Such strategies are very popular in socioeconomic applications. In addition, other examples include regression discontinuity designs (Imbens and Lemieux, 2008;Li et al., 2015), synthetic controls (Abadie et al., 2010), and their combinations Arkhangelsky et al., 2019). These designs typically focus on narrow effects (e.g., of compliers, of units at the threshold) with high internal validity, that need to be extrapolated to people or the population we are interested in. These methods can also be improved using machine learning ideas (for making IV stronger, or using more lagged values in a nonparametric way) but they require subject-matter knowledge and a level of creativity that, at least now, machine learning does not have.
Another area of active research is how to extend or generalize results from an RCT to a larger population (Stuart et al., 2018). Hartman et al. (2015) spells out the assumptions for this generalization (Pearl and Bareinboim, 2011) and proposes an approach to estimate effects for the larger population.
There are other key areas of causal inference not covered in this article. These include mediation analysis (VanderWeele, 2015;Huber, 2020) and principal stratification (Frangakis and Rubin, 2002;Mealli and Mattei, 2012), both which provide understanding of causal mechanisms and causal pathways. We have also contributed to the literature in these areas (e.g., Mealli and Pacini, 2013;Mattei et al., 2013;Forastiere et al., 2016;Mealli et al., 2016;Baccini et al., 2017;Mattei et al., 2020). These methods attempt to address questions such as: why having a dog reduces the level of severity of the symptoms of depression? It is because they make you happier, or when you have a dog you are outside and exercising more which helps depression, or that dogs help boost our immune systems? These questions are about exploring treatment effect heterogeneity with respect to post-treatment variables (e.g., spending more time outdoors as a consequence of walking the dog is a post-treatment variable). Questions regarding mediation require different tools than regression, matching, and machine learning prediction. In this new era of data science, where we have access to a deluge of data on pre-treatment variables, complex treatments, post-treatment variables and outcomes, this area of causality will become more prominent.
Can we envision a future where all these steps, characterizing the design and the analysis of observational data, and the unavoidable subject-matter knowledge, be translated into meta data and automatized? Perhaps. But this a challenge and there is still a lot of fascinating work ahead of us.