Statistical modeling: the three cultures

Two decades ago, Leo Breiman identified two cultures for statistical modeling. The data modeling culture (DMC) refers to practices aiming to conduct statistical inference on one or several quantities of interest. The algorithmic modeling culture (AMC) refers to practices defining a machine-learning (ML) procedure that generates accurate predictions about an event of interest. Breiman argued that statisticians should give more attention to AMC than to DMC, because of the strengths of ML in adapting to data. While twenty years later, DMC has lost some of its dominant role in statistics because of the data-science revolution, we observe that this culture is still the leading practice in the natural and social sciences. DMC is the modus operandi because of the influence of the established scientific method, called the hypothetico-deductive scientific method. Despite the incompatibilities of AMC with this scientific method, among some research groups, AMC and DMC cultures mix intensely. We argue that this mixing has formed a fertile spawning pool for a mutated culture that we called the hybrid modeling culture (HMC) where prediction and inference have fused into new procedures where they reinforce one another. This article identifies key characteristics of HMC, thereby facilitating the scientific endeavor and fueling the evolution of statistical cultures towards better practices. By better, we mean increasingly reliable, valid, and efficient statistical practices in analyzing causal relationships. In combining inference and prediction, the result of HMC is that the distinction between prediction and inference, taken to its limit, melts away. We qualify our melting-away argument by describing three HMC practices, where each practice captures an aspect of the scientific cycle, namely, ML for causal inference, ML for data acquisition, and ML for theory prediction.


Introduction
Explaining social action; predicting social action. Traditionally, social scientists distinguish between predictive and causal research (Boudon, 2005;Elwert, 2013;Lundberg et al., 2021;Marini & Singer, 1988;Merton, 1968;Morgan & Winship, 2014;Risi et al., 2019;Shmueli, 2010;Watts, 2014). A prediction is a statement about the extent to which one or several events (input), when they occur, supply information about another occurring event (output). This association between inputs and output may or may not have a causal explanation-a mechanistic statement about why and how the input affects the outputattached to it. A causal statement is a counterfactual statement about a difference between an outcome occurring with an event (a policy, treatment, or exposure) activated versus deactivated (fully defined in Section 4.1). 1 When this difference has a value other than zero, scholars take it as evidence that the exposure event causes the outcome.
Although many scholars agree that explaining social events requires causal statements, there is much less agreement on whether causal research also needs to be predictive (Boudon, 2005;Hofman et al., 2017;Keuschnigg et al., 2017;Marini & Singer, 1988;Merton, 1968;Shmueli, 2021;Verhagen, 2022). For example, Duncan Watts argues that "if [social scientists] want their explanations to be scientifically valid, they must evaluate them specifically on those grounds-in particular, by forcing them to make predictions" (2014, p. 313).
While we agree that there is a set of predominantly causal-research questions and likewise a set of predictive ones, our article shows that there is a third kind of research problem where causal and predictive statements form an intricate synergy. This synergy relies on further refinement of ML algorithms, and while a subset of this third kind of problem builds on Watts's argument-what we call ML for theory prediction (discussed in Section 4.3)-there are at least two additional subsets relevant for social science research: ML for causal inference (Section 4.1) and ML for data acquisition (Section 4.2). To see how this third kind of problem is possible and relevant for social scientists, we must maneuver through a similar debate in the statistical sciences. This debate is cultural, in the sense that it forms a perspective that shapes the design and purpose of statistical models (including ML algorithms).
Two decades ago, statistician Leo Breiman (2001a) identified two cultures for statistical modeling. The data modeling culture (DMC) refers roughly to practices aiming to conduct model validation, and thus, statistical inference on one or several quantities of interest-distributions, model parameters, and alike. In the context of social sciences, such inferences often refer to defining a procedure that estimates a true quantity , to minimize the error | (Freedman, 1991). This true quantity is assumed to exist independently of the statistical model. For example, in any particular year and poverty definition (e.g., dollar a day), U.S. poverty levels are assumed to have a true level , yet scholars can only approximate this level, , because they have to account for a variety of disturbances related to sampling and measurement. 2 Similar disturbances exist in causal inference, which can be viewed as a particular type of statistical inference (Imbens & Rubin, 2015;Morgan & Winship, 2014;Pearl, 2009). For example, social scientists estimating a policy (causal) effect, , of a new education program on school performance assume that this policy has a true effect, , yet this estimation is hampered by, among other things, the characteristics of students. A procedure is unbiased when the difference is zero (or negligible), over all possible realizations .
Based on such statistical concepts, Breiman argued that DMC is the dominant mode of operation in statistics.
While 20 years later, this culture has perhaps lost some of its dominant role in statistics because of the data science revolution, we observe that this culture is still the modus operandi-the leading practice of a group-in the social sciences and beyond (Goldthorpe, 2015). DMC is perhaps still the modus operandi because of the influence of the established scientific method for quantitative research in the social sciences, called the hypothetico-deductive scientific method (Hempel, 1965;Popper, 2002). This scientific method consists of cycles of deductively formulating a hypothesis from substantive theory, testing this hypothesis in a model and against data, and then, revising the theory based on empirical results (Costantini & Galavotti, 1986). Here, deductive has two meanings: first, from substantive theory, a scholar articulates a hypothesis about how two events and are related; second, from this theory or by convention, a scholar stipulates a statistical model f (often parametric and linear) for how this hypothesis of and will be tested against data (Abbott, 1988).
While testing a model against data is an act of induction, the whole procedure follows a deductive process. The scholar seeks to model the generative process of the data and understand the process between and .
Deductive reasoning uses universal propositions (hypotheses derived from general theories) to explain specific events. By explanation, we mean a theory that demonstrates how two or more events are mechanistically related (Goldthorpe, 2015;. When enough evidence has been collected that challenges existing truths, an entire paradigm can fall in favor of a new one (Kuhn, 2012). The scientific method favors DMC over other modeling cultures because DMC supports causal reasoning.
The algorithmic modeling culture (AMC) refers to practices defining a procedure, f, that generates accurate predictions, about an event (outcome), (Breiman, 2001a). By accurate, we mean predictions that are as similar as possible to the true event that f has yet not encountered (Hastie et al., 2009). A procedure is an algorithm, or a function, that takes some input , operates on this input , and then, produces an output . Often, this procedure is defined inductively (Costantini & Galavotti, 1986), that is, by letting the procedure learn from the patterns in the data, with little or no human guidance (Hastie et al., 2009).
After learning the function from this training data, the procedure is used to make predictions on new data that was not present in the training data. The test error on this new data is a measure of how well the procedure generalizes to new data, While working in AMC, scholars often care about the statistical interpretability of the used procedure only so far that it furthers their pursuit to generate accurate predictions (Lipton, 2017). To identify causal relationships between X and Y is a peripheral question. This culture is the modus operandi of many strands in engineering, computer science, industry, and policy (Sanders 2019). As AMC procedures do not align with the hypothetico-deductive scientific method, this culture lacks subscribers from causally oriented social science research (Freedman, 1991;Molina & Garip, 2019). 3 The key difference between DMC and AMC is that the former is process oriented (i.e., modeling the generative process of the data), while the latter is performance oriented (i.e., building an emulator to match the predictive performance of a social system as closely as possible). In practice, this difference in orientation implies that DMC encourages using simpler (linear) interpretable models with few parameters, while AMC typically uses large complex (highly nonlinear) models for predictive accuracy disregarding interpretability (Rudin, 2019).
Despite the (seeming) incompatibilities of AMC with the hypothetico-deductive scientific method, among some research groups, AMC and DMC mix intensely. We argue that this mixing has formed a fertile spawning pool for a mutated culture: a hybrid modeling culture (HMC) where prediction and inference have fused into new procedures that reinforce one another. As Section 4 discusses, scholars use these procedures in the pursuit of explaining how two events are causally connected by blending -prediction problems and -inference problems to the point that it is difficult to tell them apart (Athey & Imbens, 2019;Kino et al., 2021;Molina & Garip, 2019;Mullainathan & Spiess, 2017;Yarkoni & Westfall, 2017). One such procedure, which we discuss in Section 4.1, is the use of machine learning in the service of causal inference (Künzel et al., 2019). A fused procedure is still compatible with the hypothetico-deductive scientific method but stretches beyond it because it allows for a much larger portion of inductive reasoning (Nelson, 2020). Such reasoning infers generalized claims from particular observations. While this hybrid culture does not occupy the default mode of social science practices, we argue that it offers an intriguing novel path for applied social sciences.
This article aims to identify key characteristics of what we named HMC, thereby facilitating the scientific endeavor and fueling the evolution of statistical cultures in the social sciences toward better practices. By better, we mean increasing valid, reliable, and reproducible practices in analyzing causal relationships (Lundberg et al., 2021;Morgan & Winship, 2014;Watts, 2014). By valid and reliable, we mean practices that lead to studies that 'systematically measure what they intend to measure,' and 'consistently generalize to the target population of interest,' respectively. Increasing reproducible refers to practices that make studies less dependent on the architecture of deductive models than what is currently the case in DMC (e.g., linear models).
Even if there are several reasons for the reproducibility crisis in the sciences (Camerer et al., 2018), the use of less flexible models will inadvertently squeeze complex social data into an unsuitable format. That unsuitability adversely affects the robustness of applied research.
We execute our account relying on argument by example, meaning that we will selectively review trends in applied social science research as proof of the existence of HMC. From these examples, we will pinpoint the defining characteristics of HMC.
Before discussing HMC, we suggest two trends that nourish its emergence. First, to a considerable extent, HMC has emerged from applied computational research that closely interfaces with statistics and computer science-that is, data science (Efron & Hastie, 2016). This close interfacing in the social sciences is known as computational social science, which denotes any scientific study that develops or uses computational methods to typically large-scale and complex social and behavioral data (Keuschnigg et al., 2017;Lazer et al., 2020).
The stream of new data sources-administrative data, social media, digitalized corpora, and satellite imagesexplain the relevance of such a computational approach (Jordan & Mitchell, 2015). Similar computational approaches exist under the brands of digital humanities (Gold, 2012), computational psychology (Sun, 2008), computational economics (Tesfatsion & Judd, 2006), computational epidemiology (Marathe & Vullikanti, 2013;Salathé et al., 2012), and computational biology (Noble, 2002), to mention a few. All these computational approaches emerged by the beginning of the 21st century, and they provide clues to why DMC or AMC are insufficient alone to cover the new demands of the scientific endeavor: that is, to provide systematic explanations of events of reality, and thereby to deepen our knowledge of them (Bhaskar, 2008;Watts, 2014).
A second feeding ground for the evolution of HMC is the causal-inference revolution. While a randomized control trial (RCT) remains the safest way to rinse out the contaminating effect of confounding and the least assumption-demanding method to identify causality (Fisher, 1935;Morgan & Winship, 2014), scholars face ethical and practical limitations when applying an RCT to social settings (Deaton & Cartwright, 2016). For example, to estimate the causal impact of events such as economic crises Elder, 1998), famines (Sen, 1981), or climate change (Arcaya et al., 2020;Hsiang, 2016) on children's well-being, scholars would need to administer such events to a treatment and control group of children. Ethical and practical limitations of RCT are two sources fueling the causal revolution that has resulted in a myriad of new approaches tailored to inferring causality from observational data Angrist & Pischke, 2014;Hedström & Manzo, 2015;Hernan & Robins, 2020;Imai, 2018;Imbens & Rubin, 2015;King, 1998;Morgan & Winship, 2014;Pearl & Mackenzie, 2018;Peters et al., 2017;van der Laan & Rose, 2011;VanderWeele, 2015). As this revolution has partly evolved from computer science, partly from statistics and economics, scholars have creatively combined tools from DMC and AMC. As HMC synthesizes the strengths of DMC and AMC, this synthesis has resulted in practices better adapted to 21st-century social science requirements than what DMC or AMC alone can offer.
The remainder of our argument is less concerned with why HMC has emerged and more with characterizing it.

The Data Modeling Culture (DMC)
Scholars in quantitative social sciences think and operate mainly through the hypothetico-deductive scientific method (Danermark et al., 2002;Kuhn, 2012;Watts, 2014). This method is a philosophy of science that defines how scientific inquiry should be conducted (Hempel, 1965;Popper, 2002). Using substantive theories, scholars articulate their causal and descriptive knowledge in falsifiable hypotheses and operationalize them in data.
Thus, this method is 'hypothetico,' implying that hypotheses are postulated by specifying a statistical model representing how the data (events) are generated. These models represent scholars' best guess of the datagenerating process of the phenomena studies (elections, poverty, inequality, criminality, etc.). As statistical models are postulated, the scientific method is deductive, because this method aims to test whether the assumed data-generating process of the model matches the sampled data (King, 1998). Based on that model and sampled data, scholars evaluate how likely the null hypothesis-for example, a statistic that two events A and B are unrelated-is receiving support in data. When the null is receiving support, the sample statistic coincides with the postulated model. Intuitively, that is what a high p-value means: a probabilistic claim of how likely we are to encounter the postulated statistic under the null. We are willing to reject the null hypothesis if the sample value we encountered is unlikely to occur, and we favor the alternative hypothesis-that A and B are related.
Despite suffering from paradoxes (e.g., the raven paradox, irrelevant conjunctions) (Huber, 2022;Schultz, 2018) and other problems (e.g., the problem of underdetermination) (Crupi & Tentori, 2010), the hypotheticodeductive method offers quantitative-applied research a way to reason about statistical inference. Although we remain agnostic about whether scholars should rely on the hypothetico-deductive method for conducting inference, scholars require some comparable framework for producing uncertainty estimates for inference.
Because scholars revise their theories (knowledge) based on that inference, they sample more data, and refine their statistical models. And so, the circle of science continues. The requirement of testing substantive theories through an interpretable statistical model is one of the appeals and endorsement of DMC (Breiman, 2001a).
Throughout cycles of knowledge production, deduction and induction in statistical analysis form complementary scientific phases, yet they represent two fundamentally different ways of producing socialscientific knowledge based on statistical data. While Carnap (1962) is one of the leading proponents of inductivism, Popper (2002) defends deductivism. The opposition is rooted in the difference between confirmation (induction) and falsification (deduction). In the language of statistics, confirmation is about methods of estimation and falsification is about tests of significance. These procedures-estimating and testing -operate in different stages in knowledge production: the stage of estimation and stage of testing, yet jointly capture vital aspects of the scientific endeavor.
As previously mentioned, because of DMC's deductive flavor, its modeling culture relies mainly on the context of falsification; conversely, AMC's inductive preferences lead to a statistical culture favoring the context of estimation. While AMC is concerned with prediction, it relies on estimation for prediction and not the interpretation of model parameters.
To compare DMC and AMC-and characterize HMC-we define the following terminology. Scholars formulate, test, and develop social theories, , about a causal (social) system. A causal system is a set of events and relationships between events in a domain of interest. 4 Because elements of the causal system rarely reveal themselves directly to the human senses, scholars theorize about the existence of events and their causal relationships (Bhaskar, 2008;Hedström, 2005). By theory, , we mean a set of concepts that enable formulating descriptions, predictions, hypotheses, or explanations about events populating a causal system (Swedberg, 2017). Each maps into a directed acylic graph (DAG), , that formalizes and visualizes a potential manifestation of the causal system of interest (Pearl, 2009). 5 While two or more theories often compete for proposing the best explanation-meaning how well they account for the mechanisms generating observed data-they do not have to be mutually exclusive. Scholars can formulate theories at various abstraction levels, but to test them empirically, they need to match what is measurable. Thus, there is a socialscientific preference in quantitative research to engage with middle-range theories rather than grand theories (Merton, 1968).
Poverty and famine research serves as an illustration. Figure 1 shows the progression of knowledge under DMC with three stylized DAGs, , and competing to explain famines. A debate raged between Malthusians and Senians on whether food scarcity is a necessary event to cause famines (Devereux, 2007;Sen, 1981)-a debate that still influences human ecology, sustainability, and adjacent research (Daoud, 2018). Three centuries ago, Thomas Malthus argued that while population size increases geometrically, food supply increases arithmetically (Malthus, 1826). Because population size will outstrip food supply, famines will eventually emerge to balance their relationship. If we assume linear relationships among these mechanisms, the effect of scarcity on the probability of famine arising can be captured by a parameter ; the effect of food supply on the probability of scarcity can be encoded by a parameter ; and the effect of population size on the probability of scarcity can be quantified by a parameter . Based on Malthus's theory and these parameters, we can specify that explains why a famine arises, as shown in Figure 1. Amartya Sen challenged this explanation by showing that famines-at least in the modern era-can arise even when there is sufficient or abundant food (Sen, 1981). Especially when social inequality is high, vulnerable groups run a higher risk of unemployment than other groups. Unemployment causes a loss of income and individuals' capability to purchase food. This loss of capability-what Sen named entitlement failure-results in starvation, which is defined by a parameter . In Sen's theory , the causal system of is a better explanation of how famines arise. Although Sen acknowledged that population size and food supply shortage can cause famines, such shortage is "one of many possible causes" during the last century (Sen, 1981, p. 1). Thus, Sen's theory argues that population size and food supply have roughly no effect on the probability of famines, that is, Subsequent theoretical development uses both theories to offer an even more robust approach to explaining events of famines (Daoud, 2017). A critical conceptual movement is to disentangle societal and individuallevel starvation, yielding additional parameters. As shows, while Malthus's theory explains when famines are likely to arise at the societal level, it cannot explain which individuals might starve to death. Sen's theory identifies these individuals by their failing food entitlements (Reddy & Daoud, 2020). Additionally, the entitlements-causal path ( via to ) shows that famines can arise even if there is no societal scarcity.
As exemplified in the research progress of poverty and famine research, DMC practices encourage social scientists to pursue ever deeper knowledge production. Scholars aim to quantify, describe, and evaluate key events and their relationships. A description is a statistical inference about aspects of the distribution of one or more random variables, represented as a node in a DAG. A random variable encodes a distribution of the event of interest. For example, let be a binary random variable measuring the probability that an event of a famine occurs in a well-defined time (e.g., in the years 1900 to 2000) and space (e.g., Europe, Asia, or Africa).
Then, the expectation captures the average occurrence of famines in that time and space. In an applied setting, this average is estimated by the empirical average, that is, the number of famines that occurred divided by the maximum number of famines that could have occurred. As is a binary random, the quantity equals the probability of . Small caps denote variables in a DAG. A description of two or more random variables refers to associations between several events. Associations (correlation) may or may not be causal-specific assumptions need to hold for identifying causal associations (Hernan & Robins, 2020;Imbens & Rubin, 2015;Pearl, 2009).
When randomization of treatment (or exposure) is infeasible, scholars have to rely on causal assumptions about observational data (Rosenbaum, 2020). Besides methodological issues, scientific debates often refer to the content of a causal system supplied by such data. For example, scholars debate what variables populate such a system and how they affect each other. This content also defines the conditions under which empirical associations may be interpreted as causal or confounded. As Section 4.1 discusses, a confounded association is an association between two variables and that is partly determined by a common cause, . If a supposedly causal association is confounded, then that association between and is biased (Hernan & Robins, 2020). A causal association is the portion of all association that arises entirely to one variable (e.g., ) affecting the values of another variable (e.g., . To statistically capture causal associations, scientists analyze the joint distribution of the variables of interest, that is , how this joint distribution factorizes, and their underlying structural causal models. A factorization defines the causal order among events. For example, the joint distribution has many potential factorizations depending on substantive theory. While Malthus's theory, as defined in , stipulates the following factorization of how famines arise, Sen's (1981) declares that the following factorization is the best approximation of how famines arise, These factorizations follow the order of nodes and arrows in Figure 1. The scientific and policy differences are large, depending on whether or is the best representation of reality (Daoud, 2017). If best represents the causes and effects of famines, then policymakers should produce more food or contain population growth to counter famines; conversely, if is a better representation, then policymakers should follow this theory stipulating that they ought to reduce social inequality for reducing the probability of famines.
While a factorization defines the conditional dependencies in a causal system, a structural causal model (SCM) moves one step further in specificity by defining the direction of causality. That direction is defined by encoding the functional relationship among all the variables in that system. A functional relationship is a model f that specifies a one-directional mapping between an outcome (effect) and input (causes) variables. That model can be of any functional class, parametric or nonparametric, and reflects how a theory quantifies a causal system. For example, based on Malthus's theory, as defined in , we can define the following SCM, where the models are nonparametric with noise terms , where the index is over a step in the SCM, In DMC, to estimate the causes and effects of famines, scholars collect famine data and test their stipulated models. The model that fits the sample best receives scientific support. Scientific debates tend to amplify when different samples yield support for different models. Scholars evaluate support for a statistical model by interpreting how different factorizations match the sample. While scholars can use many different models, DMC-influenced scholars often use linear models to retain interpretability (Lipton, 2017). A linear model of could then have the following stylized statistical form, , where . 6 In this example, a hypothesis operationalizes an aspect of , for example, that is different from zero, and DMC-operating scholars use that linear model to test this hypothesis. Generally, a hypothesis operationalizes an aspect of , stipulating the existence of a causal relationship (edges) between random variables (nodes). Scholars use interpretable statistical models so that they can imprint their hypothesis into these models. While linear models tend to simplify social reality too much (Abbott, 1988), linear models are popular because they make this imprinting straightforward (Rudin, 2019).

p(FAMINE, SCARCITY, POPULATION, FOODSUPPLY, ENTITLEMENTS) = p(FAMINE|SCARCITY) p(SCARCITY|ENTITLEMENTS) p(ENTITLEMENTS) p(FOODSUPPLY) p(POPULATION)
, The goal of testing social theories through interpretable statistical models explains why a DMC-operating scholar shies away from a kitchen-sink method-that is, one that throws all the predictors one can find into an algorithm, and then, letting the algorithm regress the outcome of interest on them. As previously defined, ML is the subdiscipline of computer science that studies how algorithms can learn from data (Efron & Hastie, 2016;Hastie et al., 2009). As many ML algorithms are nonparametric, it is unclear how to unpack and interpret them in a direct manner, as is commonly done for linear (parametric) models (Lipton, 2017). Without a clear theoretical rationale for using ML, DMC scholars see little value in using them in the scientific process (Breiman, 2001a).

The Algorithmic Modeling Culture (AMC)
Machine learning algorithms lie at the heart of many AMC practices. A key assumption of AMC is that a system produces an association between a given set of inputs, , and a particular output, . The relationship between and may or may not be causal. The overarching goal is to develop a model that operates on these inputs, producing the best possible predictions of that has not been observed yet. The relationships between and are not necessarily causal in AMC. Figure 2 shows a stylized graphical representation of the system of associations among inputs and outputs. This graph does not represent a causal relationship because some nodes are undirected (those between all s), and thus, we call it a Bayesian network (Pearl, 2009). All DAGs are Bayesian networks, but not all Bayesian networks are DAGs. 8 In a Bayesian network, an edge denotes an association between two nodes with a noncausal interpretation. In Figure 2, all inputs are connected because the association is assumed to flow in all directions. If we were to use a linear model to represent the fully connected graph, even then would we end up with many parameters to estimate, thereby hampering interpretability: first, there are association terms between and each ; second, there are pairs of correlation parameters among all the inputs . Because scholars evaluate the predictive performance of comparing using a held-out set , scholars pursue to interpret these association only as a subordinate priority, if at all (Doshi-Velez & Kim,  Interpretable artificial intelligence (AI) seeks to clarify the decisions suggested by predictive models by moving one step beyond explainable AI (Rudin, 2019). While explainable AI focuses on trying to explain the inner mechanics of black-box models (e.g., through activation maps in image-processing models), interpretable AI focuses on creating models that produce interpretable results in the first place. These models are critical in high-stakes decision-making situations, such as when and who should be prioritized for health care or detained or bailed in criminal justice or which neighborhoods should be selected for public policy interventions. 10 However, under the spell of AMC, the goal of explainable or interpretable AI is seldom set to identifying cause and effect. As the main goal of AMC is not to develop causal knowledge, , about a system, AMC demotes causal reasoning (Pearl & Mackenzie, 2018 Although much remains to be proven before the same algorithm-strong artificial intelligence-can roam across all these domains, AMC innovations are noteworthy because scholars have developed each algorithm without explicitly knowing the causal connections among events (Domingos, 2015). Machine learning models learned the relevant associations from data, with little supervision.
The advancements of AMC resonate with Karl Pearson's idea that statistical correlation-predictability- between and is what scholars should search for to advance science. For Pearson, correlation is causation.
He argued, Take any two measurable classes of things in the universe of perceptions, physical, organic, social or economic, and it is such a dot or scatter diagram, which we reach with extended observations. In some T k X Y cases the dots are scattered all over the paper, there is no association of A and B; in other cases there is a broad belt, there is only moderate relationship; then the dots narrow down to a "comet's tail," and we have close association. Yet the whole series of diagrams is continuous; nowhere can you draw a distinction and say here correlation ceases and causation begins. Causation is solely the conceptual limit to correlation, when the band gets so attenuated, that it looks like a curve. (Pearson, 1911, p. 170) Although scholars have produced noteworthy innovations in AI using the principles of AMC, we argue that the absence of causal reasoning is a major limitation for the advancement of scientific knowledge for applied social sciences and related scientific domains (Darwiche, 2017;Pearl & Mackenzie, 2018 Even if DMC suits the scientific endeavor better, it suffers from at least two limitations. First, because of the suspicion toward AMC-style predictions, DMC scholars tend to rely on analog methods to collect information about the causal systems (Salganik, 2017). Surveys, experiments, and interviews are examples of analog methods as they give full (human) control over the data collection process. However, in the digital age, these analog approaches limit the speed and type of data collected that could be processed and used for the scientific endeavor. Second, in agreement with Breiman (2001a), we argue that a more severe limitation of DMC is that it relies mainly on model validation. A scholar formulates a statistical model and then tests that model against data. Using various goodness-of-fit metrics, this scholar then draws a set of conclusions. Yet, these conclusions are often more telling about the assumed model's structure and less about the causal system of interest. If the statistical model is a poor representation of this causal system-for example, that the relation between X and Y is not linear-these conclusions may be misleading, or nonreproducible (Breiman, 2001a).

A Hybrid Statistical-Modeling Culture: A Unifying Framework Fueling the Evolution of Scientific Practices
By upholding disciplinary traditions, university departments also inadvertently create cultural silos where DMC and AMC practices dwell (Lazer et al., 2020 Peters et al., 2017;van der Laan & Rose, 2011. In combining inference and prediction, the result of HMC is that the distinction between and -taken to its limit-melts away. We discuss our melting-away argument by describing three HMC practices, where each practice captures an aspect of the scientific cycle. Table 2 shows an overview of what these three practices constitute: ML for causal inference, ML for data acquisition, and ML for theory prediction. Although these three goals exist partly in DMC (i.e., parametric inference) and AMC (i.e., ML-style prediction), HMC fulfills them by blending inferential and predictive thinking. The three HMC practices that we will discuss in the next section combine ML prediction and inference to such a high degree that neither DMC nor AMC can comfortably host them. For example, while HMC inherits the goal of causal inference from DMC, there is no DMC equivalent for letting an algorithm discover how a DAG should be specified from data alone, with little or no guidance from substantive theory. Training such algorithms is an HMC-specific problem, called causal discovery (Glymour et al., 2019;Peters et al., 2017). As discussed in the next section, under a set of assumptions, causal-discovery algorithms can recreate a social system by suggesting potential DAGs from observational data. These algorithms are likely useful when social theory is weak, and therefore, scholars aim to generate hypotheses inductively. Table 2. Central practices of the hybrid modeling culture (HMC).
Note. The Exemplifying question row merely highlights one research question in a set of myriad questions that can be formulated

ML for Causal Inference
Before describing how ML aids in inferring causality in HMC (column one in Table 2), we will refine our definition of what we mean by causal inference. Table 3 illustrates an observed data matrix of four individuals with fictitious variable values. Table 3. A toy data set illustrating the fundamental problem of causal inference.

Ŷ β
Note. All the numbers provided in all the cells are fictitious. They are generated to exemplify how a dataset could look like, and that there will always be unobserved potential outcomes, signified by the question marks.
We define the causal effect of a (binary) variable, , on in terms of potential outcomes. Instead of merely recording each individual's outcome as observed by the data, , we assume that each individual i has two potential outcomes (Imbens & Rubin, 2015). One potential outcome represents the outcome when the individual takes the treatment (that is, ), , and one where they do not take it, . The causal effect, , for each individual i is then the difference between these two potential outcomes: 14 If we could observe both potential outcomes, we could then directly compute and thus identify individuallevel causal effects. However, the observed outcome-as supplied by the data-is a function of both the treatment and the two potential outcomes, . This function shows that the observed data, exemplified in Table 3, reveals only one of these two potential outcomes, yet both are required to identify causal effects. This impossibility of observing both potential outcomes is known as the fundamental problem of causal inference. Much of the causal-method development pertains to reasoning about identifiability and defining procedures for calculating causal effects from observational data (Hernan & Robins, 2020;Imbens & Rubin, 2015;Pearl, 2009;Peters et al., 2017). Identifiability means articulating a set of assumptions that allow a model to calculate a causal effect from observed data.

ML imputes potential outcomes
In the first combination, scholars use ML to impute (predict) potential outcomes .
As observed data only reveal half of the potential outcomes, they regard the other half of the data as a missing data problem (Imbens & Rubin, 2015). One way of handling this fundamental problem is to identify conditions for imputing these data to populate all the and cells, based on the similarity of covariates . These imputation procedures rely on common identifiability assumptions. One such central assumption is conditional independence (also known as conditional ignorability and conditional exchangability), . This mathematical statement means that the treatment is as-if randomly assigned conditional on one or more covariates.
Because ML excels in prediction tasks compared to commonly used parametric models, HMC-influenced scholars have developed many different procedures to predict (impute) potential outcomes (Künzel et al., 2018). For example, the T-learner-'T' stands for 'two'-procedure defines one ML-algorithm trained on the treated group and another algorithm trained on the control group. Depending on the scientific problem, the scholar defines the type of algorithm-a Lasso, neural network, a random forest, or a collection of algorithms (an ensemble). After training, imputes potential outcomes for the control group and imputes these outcomes for the treated group. Based on the toy data in Table 3, trains on the variables of Jane and John, and imputes for Joe and Jan; likewise, trains on Joe and Jan, and imputes for Jane and John. Then, each individual-level effect is obtained by taking the difference for the treated group and for the control group. To calculate the average causal effect, this procedure culminates by taking the weighted average over these individual-level effects in the treated and control groups, respectively, , where is the portion of treated individuals.
The T-learner algorithm is one of several causal-estimation methods, but common to most of these ML algorithms is the procedure of imputing potential outcomes (Künzel et al., 2018) or imputing the treatment effect directly Nie & Wager, 2017). Consequently, in these sorts of HMC practices,problems of HMC subsume -problems of DMC.
Imputing potential outcomes enables flexible modeling. As ML algorithms are flexible, they can automatically find the best functional form instead of relying on a scholar to select a model. As discussed in Section 3, a scholar under the influence of DMC would most likely articulate a set of assumptions for when the causal effect is identified (e.g., in a DAG) and specify a linear model to estimate this effect. For example, in our famine example depicted in Figure 1, a Malthusian scholar would argue that (of food) causes . This scholar may show the appropriateness of their assumption in the DAG, , and proceed to specify the following stylized statistical model, , where indexes famine events. 15 This model imprints the causal effect in the parameter . If the true relationship between famines and scarcity is linear, this statistical model will capture the desired causal effect by extrapolating between famine cases where scarcity was observed and where it was not. However, in most scientific domains, a scholar's preference to use a linear model follows more the desire to readily and transparently interpret a statistical model rather than following this scholar's knowledge that the complexities of reality are truly linear (Abbott, 1988;Lipton, 2017). Although linear models can capture nonlinearities via a variety of transformations, to model a nonlinear reality, ML for causal inference offers a more robust alternative by approximating the best functional form (van der Laan & Rose, 2018).
If a famine scholar followed the statistical practices of HMC instead of DMC, this scholar would formulate the same causal goal (estimand) in the shape of a -problems, and now use an ML algorithm to estimate the causal effect by imputing potential outcomes (Künzel et al., 2018;Lundberg et al., 2021). Using the same DAG, , the scholar would, for example, use a T-learner to impute the probability of famine in cases where scarcity is present and where scarcity is absent. Although both and quantify the same estimated causal effect, their key difference resides in that often refers mainly to a parametric modeling setting, whereas refers to any (nonparametric) setting-a statistical-modeling nomenclature. To calculate , the scholar benefits from the algorithmic power of AMC that is originally tailored for -problems, but now recalibrated forproblems. By predicting (imputing) potential outcomes, the scholar benefits from flexible models, but as a side effect of this statistical practice-of combining AMC for causal inference-the original distinction between and has dissipated.
Predicting potential outcomes constitutes one necessary step in calculating counterfactuals, and thus, individual-level causal effects (Pearl, 2009). The definition of potential outcomes and counterfactuals are closely related, but they refer to different scenarios. Potential outcomes refer to a scenario where the treatment assignment has not been made yet, and thus, before the treatment has been assigned, an individual has two potential outcomes, and . Counterfactuals refer to a scenario where the treatment has been assigned, but the scholar imagines what the outcome would have been had the treatment assignment been different. A counterfactual exists after the treatment has been assigned. For example, if the individual was assigned the treatment, and therefore his or her factual outcome equals the potential outcome under treatment , then this individual's counterfactual outcome is . Thus, counterfactuals enable retrospective reasoning of the form, 'What if I had acted differently, would the result turned out the same' (Pearl, 2019).
To calculate the value of a counterfactual, scholars must make assumptions about the noise (error) variables, , in a causal system (DAG) (Pearl, 2009). These variables represent any exogenous events-occurrences that are only indirectly relevant to the causes and effects of a DAG-that induce variations across individuals, and thus when this noise is known, it uniquely determines everyone's values in the data. These variations represent all factors that are particular to each individual, yet they are not necessary to the DAG, and thus, they are not always explicitly specified. For example, although an individual's genetics calibrate physiology and thus nutritional intake, in famine situations, genetics do not directly add to explaining famine outcomes. Although in large samples, these variations cancel each other out when calculating the average treatment effect, , these variations are key to calculate individual-level treatment effect, .
As the context and variations jointly determine the exact conditions when the individual took a treatment versus did not take it, we need to know the structural causal model of the causal system and the distributions of both and to calculate . A scholar can only gather these quantities when the DAG and its structural causal relationships are known (Pearl, 2009). When they are known, counterfactuals enable probability expressions such as , standing for 'the probability of observing the potential outcome had the exposure taken the value , given that we actually observed the outcome with exposure (that is ).' For example, in our famine case, this probability can refer to a specific Bengali farmer: would the farmer have survived had the Bengali government distributed food coupons (entitlements to food) to farmers , given that this farmer actually starved to death and did not receive coupons (Daoud, 2017). Although counterfactuals necessarily rely on stronger assumptions than calculating average effects, they present an exciting path for applied domains, focuses on the aggregated effect of an exposure on a population, effect heterogeneity analysis focuses on more granular effects, the group-specific effects disaggregated by subpopulations Shiba, Daoud, Hikichi, et al., 2022;Shiba, Daoud, Kino, et al., 2022). For example, although a famine or an economic crisis is likely to affect an entire country adversely, some combination of socioeconomic factors may protect certain groups better than others (Daoud & Johansson, 2020). In DMC, scholars tend to capture such effect heterogeneity with the help of interaction models. Because these models are parametric, they often need a scholar to specify the product terms explicitly. If scholars hypothesized that ethnicity moderated the effect of entitlements in explaining famines, then they would deductively specify the following model, . If the parameter is statistically significant, then that is evidence for treatment heterogeneity.
However, as shown in our T-learner example, an ML model for treatment heterogeneity does not require such explicit specification. In an ML model for causal inference, the effect heterogeneity for our famine example is defined as the conditional average treatment effect (CATE), where . While a parametric model tests a specific parametric interaction, this ML model searches over the joint conditional distribution of for group-specific causal effects, where these groups are defined by . Again, because of the flexibility of ML models, not only are HMC practices capturing the average treatment effect more robustly but these models find effect heterogeneity automatically.

ML predicts propensity scores and similar metrics
In the second combination, scholars apply ML in the service of commonly used causal methods. Even if the statistical model of interest is a parametric model where imprints the causal effect of interest, ML can service this model in an initial estimation step. While many parametric approaches that rely on two or more estimation steps can benefit from such a service, instrumental variables methods (Belloni et al., 2014(Belloni et al., , 2018Carrasco, 2012) and propensity score models ( (Belloni et al., 2018;Hartford et al., 2016).
Similarly, instead of relying on logistic regression to estimate a propensity score model and running the risk of overfitting, new methods utilize AMC-type procedures and algorithms to estimate propensity scores (Lee et al., 2010). Then, these scores are used for downstream causal estimation methods, such as inverse probability weighting. Many current methods combine the best of two worlds by predicting the treatment propensity and evaluating an outcome (regression) model van der Laan & Rubin, 2006;Nie & Wager, 2018;Schuler & Rose, 2017;Sverdrup et al., 2020).

ML facilitates interventional and sequential decision making
In the third method combination, HMC scholars use ML for policy optimization for dynamic sequential decision-making (Russo et al., 2018). By dynamic, we mean a situation where a treatment assigned at time t has a causal effect not only on the outcome at time , but also on subsequent treatment decisions (Hernan & Robins, 2020). For example, a doctor treating a cancer patient seeks to identify the optimal treatment with the least amount of pain (Gottesman et al., 2019;Murphy, 2003). What and how much dosage this doctor decides to inject into the bloodstream of her patient at time will not only affect the patient's pain level in the next sequence but also affect the doctor's set of options in future sequences. Consequently, this sort of sequential decision-making problem translates to finding the optimal policy with the desired causal effect with as few treatment steps as possible (Murphy, 2003).
On the one hand, the problem of 'searching-over-potential treatments (or actions, policies) to find the optimal effect' aligns with DMC's ambitions of identifying a causal effect. On the other hand, this problem also has a predictive structure similar to those AMC problems of algorithms playing computer or board games (Russo et al., 2018). Consider DeepMind's reinforcement-learning algorithms AlphaStar and AlphaGo that are able to select the optimal decision at time that predicts the best chance of eventually winning the game. These algorithms are purely predictive, lacking any causal component. Yet they work. Based on the predictive power of reinforcement-learning algorithms, HMC-influenced scholars combine these algorithms and causal inference (Zhang & Bareinboim, 2020). One way of achieving these combinations is by estimating the causal effect of a particular decision and predicting the final outcome for a sequence of similar decisions. The algorithms achieve this complex task by using similar principles from the first way in which ML supports causal inferenceimputing potential outcomes. Reinforcement algorithms impute potential outcomes for many possible sequences, and then based on these synthetic data, they select optimal decisions. Scholars have also combined these algorithms with wearables and sensing technologies, thereby embedding sequential medical interventions directly into patients' daily lives . Other scholars explore how reinforcement algorithms can be used to find optimal economic policies (Kasy, 2018;Zheng et al., 2020).

ML discovers causal systems
In the fourth method combination, scholars apply ML for causal discovery. Many social-scientific domains lack robust theories about how events in that domain are causally connected (Swedberg, 2017). The lack of such theories leads to imprecise DAG representations of the causal system of interest. To remedy this lack and to fuel causal theorizing, causal-discovery algorithms suggest DAGs for further analysis (Jaber et al., 2018;Peters et al., 2017). Assuming the existence of observational data that represents all key variables of a causal system, these algorithms search over the covariate space to find and suggest DAGs in a data-driven way (Spirtes et al., 2001).
The causal-inference toolbox offers different algorithms for suggesting DAGs (Glymour et al., 2019). An independence-based algorithm tests for permutations of conditional and unconditional independence among variables , using their joint distribution . In a data set containing only two variables, if that algorithm assesses that these variables are likely independent , then it assigns a low probability that these two variables are causally connected. Because they are independent, their t + 1 t t X , X , … , X  (Shah & Peters, 2020). While conditional-independence tests are relevant for many statistical practices (Dawid, 1979), these tests form the foundation for many discovery algorithms, and the success of these algorithms depends on the capacity of such tests (Shah & Peters, 2020). Second, causal discovery assumes that all relevant variables about the causal system of interest are measured (Robins & Wasserman, 1999). That is a strong assumption, required because competing social theories will likely stipulate different data representations and measurements. These requirements will likely not be fulfilled for even a modestly complex causal system.
Third, even if all variables were measured and conditional-independence tests were unbiased, causal-discovery algorithms rarely find one optimal DAG, but many candidate DAGs that are equally well suited to represent the same causal system given the data (Peters et al., 2017). The observed data can only do so much.
Nonetheless, combined with domain knowledge or randomized-control trials, scholars continue the filtering of plausible and implausible DAGs. This combination of machine-suggested DAGs and human-domain knowledge of the causal system have proven fruitful in, for example, genetics. Because of the complexity of how genes interact and regulate each other, scholars are yet to precisely determine the causal direction of any genetic system. Causal-discovery algorithms have proven useful in suggesting a causal representation of how genes regulate each other (Glymour et al., 2019). While causal discovery is a vibrant field of research with promising contributions, applied-social science is yet to evaluate its usefulness.
In sum, causal inference is benefiting from ML in at least four different ways. As discussed, by a mixture of DMC and AMC practices, a new set of synthesized statistical practices have emerged. This new culture, HMC, weaves prediction and inference into synthesized procedures. 'ML for imputing potential outcomes' is perhaps T k the clearest confirmation of the existence of HMC. Here, the goal is to capture , but through ML imputations the traditional distinction of and has become superfluous.

ML for Data Acquisition
Ideally, scholars would be able to measure all the necessary variables that represent the causal system of interest. The primary and minimum variables for causal inference in observational studies are a treatment , an outcome , and a confounder . Figure 3 shows a DAG containing the basic set of variables for causal analysis in observational settingst. If either the treatment or outcome is unobserved, quantifying their causal connection is impossible; if only the confounder is unobserved, statistical models will produce biased results for . Because a confounder is a variable that affects both the treatment and the outcome, the magnitude of this bias depends on how strongly this confounder affects both of these two variables (Hernan & Robins, 2020).
For any scientific domain, therefore, measuring high-quality data about the causal system of interest remains a crucial task.
The second practice of HMC mobilizes AMC-type algorithms to support this task. Table 2 defines the key characteristics of these measurement practices. While in DMC-type practices, scholars rely more on analog methods to measure data and less on digital ones, in AMC, this emphasis is reversed (Salganik, 2017  τ extract data from structured or unstructured digitized sources. A digital source is a piece of information existing as '1s' and '0s' on a computer. By structured, we mean information that exists as a tidy data matrix in well-defined variables and values. Conversely, unstructured information has yet to be preprocessed for a meaningful structure. Digitized historical archives is one example of unstructured information (Salganik, 2017). Processing a large amount of structured and unstructured information is one of the tasks where ML excels (Blei et al., 2003).
Mobilizing research assistants to code up the political content of archival policy documents exemplifies an analog method; training natural language processing (NLP) algorithms to do the same thing is a digital method (Grimmer & Stewart, 2013). Likewise, in famine research, employing assistants to analyze geographical maps to code events of drought is an analog method; training image-recognition algorithms to detect drought in satellite images is a digital method Mahecha et al., 2020). Each method has its strengths and weaknesses. Because analog methods rely on humans, these methods are better tuned to measure sensitive content, but they are slower and more costly. While digital methods still require human supervision for training data and interpreting the content of unsupervised results, they are still faster and cheaper when applied to large data sets (Salganik, 2017).
In the digital era, several additional petabytes of data are made available every year. Equipped with DMC logic for sampling, HMC scholars use AMC practices to efficiently sift through these data for measurement.
Although the role of digital methods is considerable in HMC, analog methods remain an essential part of HMC practices. While unsupervised algorithms reduce high-dimensional data (e.g., an archival document) to lowdimensional representations (e.g., a topic-model distribution), scholars still have to interpret these lowdimensional representations manually-what Chang et al. (2009) metaphorically compared to reading tea leaves. Unsupervised ML consists of algorithms that reduce dimensionality in a set of covariates , without any reference to a specific outcome, , thereby requiring more human supervision for interpretability. Because supervised ML consists of algorithms tailored to predict a prespecified outcome, , using as input, these algorithms require high-quality labeled data prepared by scholars. Often this labeling of data relies on surveys or qualitative coding of digital sources, based on the scholars' expertise. The gain of supervised ML is that the algorithm is more effective in generalizing to unlabeled data (Hastie et al., 2009).
When applying ML to process digital sources, scholars are conducting a form of data measurement. Instead of asking people directly about their material living standards, scholars let a machine capture these standards from digital sources Jean et al., 2016). A limitation of using ML to conduct such measurement is the added error in the data processing.
The total-survey-error approach helps characterize the sources of error in traditional surveying (Groves & Lyberg, 2010). Analog surveys pose specific challenges arising from formulating questionnaires, conducting interviews, and other disturbances affecting measurements. This approach provides a framework to characterize what these sources of error are when a scholar measures an event and quantifies it as a X Y Y X X * variable in the data . Although different errors exist, they categorize either as bias (systematic error) or variance (random error). Systematic error, , is any information that shifts the sample estimate away from the true value in a consistent manner. Inaccurately calibrated instruments usually cause such shifts. An instrument is a means for acquiring information about the event of interest. For example, poorly phrased wordings in a questionnaire (the instrument) will lead to over-or underreporting of a respondent's behavior.
Systematic error is reduced by improved calibration of the instrument. Random error, , is the natural variation arising from sampling procedures that affect the accuracy of the instrument. Random errors cancel each other out as the sample size increases because negative deviations for one individual eventually cancel out by a positive deviation for another . Putting these two sources of error together, a survey will always be an imperfect representation of any variable in a causal system, as described by the following formula, .
While scholars can improve their instruments and surveying execution to reduce these two sources of error, they will eventually hit a limit where they start trading bias for variance or vice versa. Both analog and digital methods suffer from the same limitations, yet digital methods inject additional errors (Salganik, 2017).
Because many supervised ML algorithms rely on the analog data source for training samples, these algorithms can only recreate imperfect representations of and not of the true event . This imperfection arises from training and testing the algorithms (Hastie et al., 2009). Famine and poverty research serves us as an illustration. Many scholars use analog methods to measure people's living conditions, , by surveying people's income or material assets, . In designing and executing their surveys, scholars encounter both systematic and random errors, , that will distort their sample estimate, . These errors are additive if there is no interaction between systematic and random errors. Nonetheless, is the best analog representation a scholar can produce to capture .
Subsequently, suppose that another group of scholars is aiming to speed up the surveying of poverty by combining analog and digital sources such as satellite images (Blumenstock et al., 2015;Jean et al., 2016;Yeh et al., 2020). These images reveal the living conditions of people, as they appear from the sky. So, this group collects as a training sample (analog data) for their ML algorithm, and satellite archives (digital data), . Their goal is to train an algorithm to measure (predict) income from the pixel features of these satellite images . This procedure constitutes defining , where and an error composed of , assuming again additive errors. Although this MLsatellite approach to measure poverty is faster than letting humans survey poverty, each new ML step added to represent continues eroding at the human survey produced . This is already an imperfection of , and each ML step induced new systematic and random errors. This erosion is expressed in the following mathematical relationships for one analog and one digital measurement, Systematic and random errors compound from the analog (i.e., ) and digital methods (i.e., ). The more algorithmic transformations added on top of the first procedure , the more these errors are likely to propagate, continuing to corrode , and thereby aggravating the measurement error.
Although measuring events digitally induces an additional error, digital approaches are usually faster and less costly. These two advantages have prompted HMC-influenced scholars to new sources of data to impute the missing data to populate their representation of the causal system of interest-a procedure that Bareinboim and Pearl (2016) formalized under the name data-fusion problems. For example, scholars use topic models or other representations to summarize text to capture confounding (Åkerström et al., 2019;Blei et al., 2003;Blei, 2012;Egami et al., 2018;Mozer et al., 2020;Roberts et al., 2016); they process images to measure outcomes Jean et al., 2016) or confounding (Jerzak et al., 2022, in press); they assemble corpora of health records to record patients' background (Hsu et al., 2020); they use video and audio to capture other representations to facilitate scientific inquiry (Tarr et al., 2022;Knox & Lucas, 2019). These new measures are then used for downstream causal inference tasks.
However, a critical challenge of using digital sources is when the same source is used two or more times for measuring multiple variables, or when information about one variable leaks into another variable.  calls this treatment leakage when it involves measuring the confounding variable but treatment information leaks into the measurement of this confounding variable. This leakage leads to posttreatment bias. Thus, refining analog and digital measurement approaches will remain a crucial part of the scientific endeavor.

ML for Theory Prediction
A vital goal of the scientific endeavor is to explain not only past observations of how an event causes another, , but also predict future instances of these causal events (Gelman & Imbens, 2013;Watts, 2014).
This goal translates to testing a theory's predictive power: theory prediction, for short (Freedman, 1991;Kleinberg et al., 2017;Marini & Singer, 1988;Peysakhovich & Naecker, 2017;Salganik, Lundberg, et al., 2020;Watts, 2014). One test of a scientific theory, , and its corresponding DAG, , is to evaluate the amount of statistical support it received in data, focusing on causal effects . Another test of the usefulness of a scientific theory is to evaluate how predictive is for other populations (Billheimer, 2019 Although Breiman (2001a) argues in favor of prediction as a tool for fueling the scientific endeavor, he remains unclear about how exactly prediction provides that support. We define a DMC-prediction as a recreation of an event Y that is as similar as possible to the true event yet not observed but generated by a validated statistical model specified under and its DAG, . In our famine example, the statistical model representing a Malthusian DAG constitute one such validated model. A DMC prediction of famines is then , for famine events with characteristics , for cases that the statistical model has not been observed before. In contrast, an AMC prediction is a recreation of an event that is as similar as possible to the true event yet not observed, but it is not necessarily conditioned on any validated causal model. In its distilled form, an AMC prediction is a pure prediction problem that uses an algorithm and data source (Kleinberg et al., 2015). For example, an AMC prediction of our famine example involves collecting any input variables that carry some association to the outcome (famines), training an ML model on these data, and then evaluating this model's predictive power on a held-out set (Okori & Obua, 2011). AMC predictions constitute a horserace among candidate algorithms and data competing for the best predictive performance-Kaggle-style competitions. 16 DMC predictions are also horseraces but only among scientific theories and their statistical representations .
HMC predictions are DMC predictions, but because HMC predictions rely on ML, they submit its prediction practices to the principles dictated by AMC. The two essential principles of AMC-prediction practices are the use of regularization and evaluation using held-out samples. These two principles minimize overfitting that will likely come about when scholars attempt to squeeze more variables into their models or tweak their models' functional form to fit the data better in-sample. Additionally, as future observations are unobserved in the present, held-out samples stand in for these missing observations (Risi et al., 2019 A second way to test a theory's predictive power is to evaluate whether the identified causal effect , based on a DAG , found in one population can be verified in another, preferably a randomized control trial when feasible. In computer science, these verifications are called transportability of results (Pearl & Bareinboim, 2014), and they are also known as transfer learning, life-long learning, and domain adaptation (Chen & Liu, 2018;Johansson et al., 2019); in statistics, and elsewhere, they go under the name of generalizability or external validity (Deaton & Cartwright, 2016). Some scholars call these verifications forward-causal questions when the event is generalized to lie in the future (Gelman & Imbens, 2013). For example, finding that a social policy is working as intended for a population (e.g., the poor) in one county, scholars may ask how well this policy generalizes for the same county but for future populations. The more populations for which exists and has a similar value, the more support a theory receives.
To systematically evaluate the predictive performance of different theories' predictions (i.e., ) and causal claims (i.e., ), scholars require a common task framework (Donoho, 2017). This framework is a set of principles defining how predictive performance is scored and evaluated. Its purpose is to ensure the comparability of results. Such a framework follows at least three principles. First, scholars require a publicly available training data set for which they may operationalize their respective scientific theories , articulate their causal assumptions , and formulate their statistical models . This data set must be sufficiently rich to accommodate many different plausible theories . Second, as in any competition, scholars have to make themselves known to each other and agree to the rules of the competition. These rules must at least specify one estimand (quantity of interest) and how scholars' estimators will be evaluated (e.g., minimizing mean squared error). Third, the competing scholars have to designate a scoring referee for which they can submit their estimators. This referee automatically and objectively checks each scholar's estimator against a held-out data set that has been kept secure behind a firewall during the competition. The referee presents the results and estimators transparently, enabling the competing scholars to reproduce each other's results, thereby enhancing scientific learning.
Much of the success of AMC is supported by well-functioning common task frameworks (Donoho, 2017).
Often there are clear outcomes defined, and transparent procedures for scoring each competitor prediction, . For example, much of the development of image-recognition algorithms owns its success to the publicly available data source, ImageNet (Deng et al., 2009). It gave deep-learning scholars a common benchmark, which new algorithms could be tested against. In DMC, setting up a similar infrastructure is challenging because the estimand is often a causal effect, , which is per definition, unobserved in the data. As neither the scoring referee nor the scholars know the true , there is no way to score contending estimators. This implies that a common task framework for causal problems has to be complemented with additional assumptions for the competition to work. One way of handling this insufficiency is to combine observational data with randomized control-trial data (Lin et al., 2019). As RCTs require the least assumptions, one target estimand is the average treatment effect in RCT, . However, this target will only work if the RCT has high internal and external validity. Another way of handling this insufficiency is to rely on simulations in which the causal effect is known exactly, . Although that solves the problem of the target estimand, it introduces the problem that the data is merely an artificial representation of a causal system. While establishing a common task framework to evaluate causality remains a challenge in many disciplinesespecially in the social sciences-several common task frameworks focus on an observable estimand: predicting outcomes . For example, the Fragile Families Challenge is a scholarly mass collaboration tailored to predict six life outcomes for children age 15 (Salganik et al., 2019;Salganik, Lundberg, et al., 2020; τ Ŷ Gk τ Gk Salganik, Maffeo, et al., 2020). These outcomes are child grade point average (GPA), child grit, household eviction, household material hardship, caregiver layoff, and caregiver participation in job training. This challenge received 437 scholarly competitors (of which some worked in teams) that resulted in 160 valid submissions. All these submissions were evaluated using means squared error in held-out data. In the social sciences, this challenge is among the first to devise a common task framework for the advancement of science (Meng, 2020).

Conclusion
To recap, social scientists regard predictive statements and causal inferences as two distinct research problems (Watts, 2014). To a large extent, this distinction follows the fault line of the two cultures of statistical modeling (Breiman, 2001a). The data modeling culture (DMC) is the modus operandi in applied social science; the algorithmic modeling culture (AMC) dominates computer science, engineering, and many policy practices.
DMC research tends to favor causal inquiry but is limited by being caught in model validation; AMC embraces predictive inference but is not well tailored for the scientific method. Nonetheless, while pure causal and predictive inferences have their place in the social sciences, our article has shown that there is a third kind of problem that synthesizes predictive and causal practices to the extent that it is hard to tell them apart . Given the new scientific opportunities and challenges arising in the digital era, methodologists have developed various techniques fueling the scientific endeavor (Athey & Imbens, 2019;Imai, 2018;Kino et al., 2021;Lazer et al., 2020;Pearl, 2019). Through these developments, a new modeling culture has evolved that has mutated from DMC and AMC: the hybrid modeling culture (HMC). This article has identified the main characteristics of HMC and showed how it synthesizes components of DMC and AMC under the umbrella of explanatory (causal) social-scientific research (Boudon, 2005;Lundberg et al., 2021;Marini & Singer, 1988;Merton, 1968;Risi et al., 2019;Watts, 2014). First, the overarching aim of HMC is to further the production of knowledge. HMC copies this aim from DMC, where the overarching goal is to explain how X causally affects Y. Scholars achieve this goal by assuming a causal system, stipulated by a substantive theory . Under these assumptions, they test competing explanations against data, refute, revise, or update theories depending on these tests' results.
Second, HMC does not restrict itself to the commonly used statistical models offered by DMC for statistical and causal inference but incorporates the range of powerful algorithms offered by AMC (Bail, 2017;Lazer et al., 2020;Nelson, 2020;Turco & Zuckerman, 2017;Watts, 2017). As scholars combine AMC-type algorithms for DMC-inspired inference (Lundberg et al., 2021;Molina & Garip, 2019), they erode the traditional distinction between prediction and statistical inference . This erosion recasts the scientific problem into problems-identifying ways to impute (predict) potential outcomes and .
A possible criticism of HMC is that because it has the same goal as DMC, it would suggest that it is not all that separate of a culture from DMC. While these two cultures have the same goal, they use different practices to achieve that goal. Another way to appreciate the differences among the statistical cultures-HMC, DMC, and also AMC-is to recast the issue of statistical culture as an issue of methodological paradigms (Kuhn, 2012). A methodological paradigm is a way of looking at the world of data and models. Within the paradigm of DMC, scholars would be violating fundamental principles of statistical modeling if they relied on ML for causal inference. Often, the (causal) estimands of DMC is a parameter in a prespecified model; the causal estimand of HMC is . Some of the debates that raged between scientific camps within the field of causal inference-for example, those between the perspectives of Judea Pearl and Donald Rubin-are likely to have their roots in the paradigmatic differences between DMC and HMC (Imbens, 2020). Some of those debates pertain to the use of DAGs, but several of them are about what estimands are scientifically valuable to target in the first place. This targeting implies also that DMC is more stringent about using traditional causal estimation methods, while HMC is lax. Even if HMC is not associated with a particular method, we have shown that it tends to use AMCtype methods. Not because these ML algorithms are fancy, but because they are suitable for the age of data science and for scientific inquiry.
It is a vindication for HMC-but perhaps also a historical irony-that Breiman proposed the use of the random forest as a hallmark of AMC (Breiman, 2001a(Breiman, , 2001b, but then two decades later, that same algorithm has been adapted for HMC-style causal inference. At the time of writing, there are several derivatives of the random forest such as the generalized random forest , Bayesian Additive Regression Trees (Hahn et al., 2020;Hill, 2011), and other tree-based methods for effect heterogeneity (Brand et al., 2021).
Hybrid modeling culture informs a better social science than AMC or DMC. Embracing the logic of HMC will likely enable scholars to venture beyond current frontiers with deeper confidence. First, loosening the dependence on DMC-type linear models will likely produce more robust research. Models will be less dependent on researchers' discretion in specifying, tweaking, and tinkering with statistical models. Machine learning algorithms are not entirely immune to tinkering, but the processes of cross-validation and regularization provide some safeguarding against cherry-picking results. Second, and relatedly, the use of ML for causal inference may help alleviate the replication crisis in the social sciences. Successfully replicating a study has several dimensions, but what HMC can assist with is precisely the use of more robust models via not only cross-validation but also regularizing ML algorithms. As previously mentioned, regularization forces an algorithm to generalize better to other data than DMC-type models. Third, embracing the logic of HMC implies that scholars will be better equipped to navigate the use of analog and digital sources. When all the sufficient data (e.g., a survey) is available in analog form, then rely on these data only; but when such analog data are insufficient for the causal estimand of interest, an HMC-informed scholar will be able to better navigate statistical issues than a DMC scholar. The HMC-informed scholar will with deeper competence combine analog data with digital sources to use ML to measure what is missing.
Fourth, HMC will likely inspire social scientists to increase the precision of their theories. Currently, many theories abstain from defining all the structural relationships in the causal system because of the inherent causal complexity of the social world. Nonetheless, combining human-assisted causal discovery algorithms will β τ enable researchers to process large data, and to start engaging in translating their theories into well-defined structural relationships. Defining structural relationships means characterizing not only the relationship of interest, that is, between the cause and effect, , but also characterizing how the context variables, are related among each other and to . For example, a theory defining exactly how the social environment interacts with human genes for explaining students' university grades is a complex task because scientists do not have strong theories about all the causal structural relationships between genes and social environment (Beauchamp, 2016;Courtiol et al., 2016). Using causal discovery algorithms is a viable way of starting to create some order in these complex data and thereby refining social theories.
Hybrid modeling culture gives rise to new methodological challenges and thus inspires at least three directions for future methodological research. Standard errors are the foundation of DMC-type of inference, but they measure only sampling variability. Quantifying sampling variability is insufficient for HMC, as there are additional uncertainties threatening HMC-type inference. Because HMC scholars use multiple data sources and models, and in different phases, they need to capture additional uncertainties reflecting this multitude. They need to handle a trinity of inference: multisource, multiphase, and multiresolution (Blocker & Meng, 2013;Li & Meng, 2021;Meng, 2014).
Multisource refers to the increasingly common situations where a single study uses data sets from different sources but with varying quality (Meng, 2018). Because of this variation, big does not imply better, and a large amount of biased data can do far more damage than smaller data sets because they can lead us to be overly confident in erroneous results. In the case of using satellite images for measuring poverty, data comprises household surveys, satellite images, nightlight data, and other sources (e.g., ImageNet for transfer learning).
These data are collected from different continents and years and thus are plagued by sampling variation and biases comprising different satellite technologies, seasonality, and changing survey definitions (Burke et al., 2021). A critical question is then how scholars may account for these multisource uncertainties beyond merely producing standard errors in DMC-type statistical inference.
Multiphase refers to the common practices where data are typically collected, preprocessed, and analyzed sequentially by parties with different goals and access to information and limited communications among them (Blocker & Meng, 2013). As a result, scholars may encounter the multiphase inference paradox: every party engages in a statistically valid process, yet the ultimate output from the collective processes can be statistically invalid due to the uncongeniality among the processes. For example, household surveys are sampled with the statistical aim of representativeness of a country and satellite images for monitoring the planet, yet surveys and images are combined for training remote survey methods to measure health and living conditions . But survey data typically suffer from nonresponse, which is often imputed by the survey collectors, and satellite images need to be preprocessed before analysis. Another source of uncertainty is therefore that of data preprocessing, and the question of how scholars may account for these different phases in their analysis.
Multiresolution is about the unit of analysis (aggregation) and the resolution of these units, such as measurement frequency and granularities of the features (Li & Meng, 2021). While big data encourage finer resolution analyses (e.g., individualized treatments), there is typically a trade-off between data availabilities and resolution levels: the higher the target resolution the fewer relevant data points. Social-scientific inquiry is about societies, economies, and ecologies, and therefore, various analyses need to be of the appropriate resolution to be relevant for informing and evaluating various local and global policies, yet social science data often do not have the desired resolution. Thus, another set of future questions is about how one can reliably learn from low-resolution data and infer conclusions about the high-resolution target. The mismatch of resolution leads to yet another source of uncertainty.
Even if the trinity of inference can threaten HMC research from several sources of uncertainty, HMC offers a compass for the scientific endeavor in the digital era. The social-scientific endeavor is likely to gain more by moving beyond AMC prediction and DMC inference. HMC supplies a statistical culture that capitalizes on the benefits of ML algorithms while maintaining an eye on the scientific goal: explaining social reality.