Additional discussions and rejoinder forthcoming.
Bailey (2023) convincingly argues that random sampling in polling is dead and new approaches informed by the Meng (2018) equation (Box 1) are needed to restore confidence in election forecasting. In many medical applications, random sampling has never been feasible or even considered desirable (Ebrahim & Davey Smith, 2013; Msaouel, 2021; Richiardi et al., 2013; Rothman et al., 2013b). However, we will consider here how the Meng equation can provide broader insights beyond conventional sampling theory to inform the design and interpretation of randomized controlled trials (RCTs) in medicine.
Standard sampling theory is used in health surveys and other medical applications where the primary goal is to obtain representative samples and make generalizable inferences regarding larger populations (Woolsey et al., 1954). But this framework is not applicable for most medical research, which instead has adopted the methodological tools developed within the distinct field of experimental design (Fisher, 1949). Rather than using probability samples, clinical trials enroll convenience samples consisting of patients often accrued sequentially over time, who have logistical access to the trial, fulfill prespecified eligibility criteria, and are willing to consent to trial entry. In standard two-arm medical RCTs, the accrued convenience samples of patients provide the basis for comparative inferences regarding the differences in the effect of one intervention versus a control within the sample (Rosenberger et al., 2019; Senn, 2021). Because random sampling is not used, textbooks typically note that statistical logic cannot be used to extrapolate the knowledge generated from RCTs to broader populations (Edgington & Onghena, 2007). A potential resolution to this impasse could be to adapt methodological innovations emerging in the polling field (Bailey, 2023; Meng, 2018) to also tackle the growing challenges associated with nonprobability samples in RCTs.
Box 1. The Meng equation using the notation by Bailey (2023)
Strong arguments have been made against the use of representative samples in experiments (Rothman et al., 2013a, 2013b; Senn, 2021). Our research group often follows this advice when we design our studies to develop new therapies and understand how and why renal medullary carcinoma (RMC) occurs in certain populations (Shapiro et al., 2021; Soeung et al., 2023). RMC is a deadly kidney cancer that mainly afflicts young individuals carrying the sickle cell trait, an inherited blood disorder that otherwise typically has minimal health impact (Msaouel et al., 2019). When we conduct our animal experiments to model RMC, we do not sample representative members of the mouse species, but instead use genetically homogeneous mice grown in the laboratory to carry the human sickle cell trait under carefully controlled conditions.
Similarly, physician-scientists are taught that biomarker-based patient selection in clinical trials is key for the success of new targeted therapies (Dienstmann et al., 2013; Garralda et al., 2019; Melamud et al., 2020). A commonly cited example in modern oncology is the development of trastuzumab, the drug that almost single-handedly kickstarted the modern era of precision oncology (Hayes, 2019). Trastuzumab is designed to target the human epidermal growth factor receptor 2 (HER2), which is highly expressed in 25—30% of breast cancers (Msaouel, Lee et al., 2022). The RCTs that led to the regulatory approval of trastuzumab were specifically designed to enroll patients whose breast cancers expressed HER2 (Eisenhauer, 2001). A major challenge when exploring causal interactions between patient covariates such as HER2 expression levels and interventions tested in RCTs is that much larger sample sizes are typically needed to draw reliable conclusions (Brookes et al., 2004; Gelman et al., 2020). If the pivotal trastuzumab RCTs had enrolled representative samples from the whole population of patients with breast cancer irrespectively of HER2 expression, they would have required substantially larger sample sizes to detect the efficacy of trastuzumab and would also have needlessly exposed to its side effects patients whose breast cancers did not express the drug target (Msaouel, Lee et al., 2022). But even beyond the anecdotal examples within precision oncology, there is a broader recommendation toward reducing outcome heterogeneity in patients enrolled in RCTs to increase the precision of our comparative inferences (Senn, 2004, 2021).
In a related discussion in the datamethods forum, Sander Greenland pointed out that random sampling and the randomization used in RCTs may be viewed as isomorphic in that random sampling equates to randomly allocating who is included or excluded from the sample, while randomization is like randomly sampling from the total selected experimental group to determine who gets the treatment. Accordingly, the Meng equation may also be used to elucidate key advantages of using biologically identical animal models in preclinical studies and carefully selected unrepresentative patient samples in clinical trials even in the absence of randomization (Box 1). More specifically, the ‘data difficulty’ term σγ in the Meng equation corresponds to the square root of the variance for the outcome Y. Therefore, a major benefit of reducing the heterogeneity of Y is that it drives data difficulty toward zero. This reduces sampling error and thus allows us to more reliably determine the effects of each experimental intervention irrespectively of the data defect correlation ρR,Y defined as the association between Y and selection into the experimental sample. An additional benefit suggested by Figure 1 in Bailey (2023) is that the homogeneous populations are far smaller than heterogeneous populations, and therefore, per the ‘data quantity’ term in the Meng equation, the same experimental sample size will produce smaller error. The trade-off is that the population from which the sample is drawn becomes narrower, thus hindering generalizability. Because the primary goal of experiments is to estimate the comparative differences in Y between the studied interventions within the sample, the loss in generalizability is often deemed an acceptable and pragmatic sacrifice.
Generalizability here refers to the extension of inferences from the RCT sample to a population of patients that either matches or is a subset of the population of patients eligible for the trial. Conversely, the term transportability is used to describe the extension of knowledge gained from the RCT to populations that were not necessarily represented in the enrolled trial sample (Degtiar & Rose, 2023). Diverse methodological approaches are being developed to model RCT generalizability and transportability (Bareinboim & Pearl, 2021; Dahabreh et al., 2020). A common strategy is to assume transportability from RCT samples to future patients who share the biological causal properties that mediated the comparative treatment effect. For example, inferences from an RCT that compared therapies targeting HER2 in breast cancer can be transported to patients with HER2 positive breast cancer that are otherwise entirely distinct from the sample used in the RCT (Msaouel, Lee et al., 2022). Conversely, these RCT results would not be applicable to patients whose breast cancer does not express HER2, even if they otherwise closely resemble the sample enrolled on the trial.
While representativeness may not be necessary for drawing comparative conclusions in RCTs, this does not negate the ethical and societal value of inclusiveness, which aims to increase the participation of minority and underserved groups in RCTs to mitigate health care inequalities (Behring et al., 2019). An additional scientific concern driving the need for both inclusiveness and representativeness is whether the magnitude of the estimated comparative effect in an RCT might vary between subgroups, defined by covariates like race or gender. If biological knowledge or prior clinical data suggest an interaction between the intervention and such covariates, and if the difference in effect size is substantial enough to alter conclusions about the comparative impact of the intervention, then failing to include enough minority patients in an RCT would be a disservice to society (Msaouel, 2022).
Epistemic humility is another argument for obtaining more representative patient samples in RCTs. Medical interventions can influence efficacy and adverse event outcomes by interacting with patient covariates in unanticipated ways. In these scenarios, excluding patient populations from RCT enrollment may deny them the unexpected benefit of the intervention under investigation. For example, phase 3 RCT testing of the new HER2-targeting antibody-drug conjugate trastuzumab deruxtecan recently revealed significant efficacy in patients with breast cancer whose HER2 levels were previously considered too low to justify treatment with conventional trastuzumab (Modi et al., 2022). Furthermore, knowledge generated by unrepresentative homogeneous laboratory models may not extrapolate to heterogeneous patient populations treated in real-life clinical settings (Pound & Ritskes-Hoitinga, 2018).
Statistical estimates of our uncertainty are theoretically easier to quantify if the results were yielded by a mechanism that included a random component (Greenland, 1990). Random sampling accordingly allows the estimation of valid uncertainty estimates for inferences derived from the sample and extrapolated to the larger population. Because patients enrolled in medical RCTs are typically convenience samples, it is far harder to generate statistical estimates from each sampled group that would apply to populations. For example, the phase 3 RCT that tested trastuzumab deruxtecan versus control in patients with breast cancer reported the group-specific median overall survival estimate for each of the treatment and control groups (Modi et al., 2022). More specifically, the median overall survival for the trastuzumab deruxtecan group was reported as 23.4 months with a 95% confidence interval (CI) of 20.0 to 24.8, whereas the median overall survival for the control group was 16.8 months (95% CI 14.5 to 20.0). These reported 95% CIs imply inferences about some larger population and are exceedingly hard to estimate because they are not based on probability sampling. However, RCT reports all too often implicitly assume away how convenience sampling influences the estimations of these group-specific outcomes. Bailey (2023) provides a roadmap on how to identify and account for these sampling biases in order to generate potentially generalizable group-specific estimates. It is possible that these ideas first emerged within political forecasting because such estimates can often be calibrated against actual election outcomes (Meng, 2022). Conversely, it is difficult to verify how much confidence we should have placed in the reported group-specific estimates from medical RCTs.
Instead of random sampling, the random component in most medical RCTs is introduced via randomization, that is, the random allocation of interventions. Typical randomized trial designs ensure that each patient who enrolls in a two-arm RCT has equal probability of receiving either of the two interventions. This allows the reliable estimation of uncertainty estimates for comparative differences in the outcomes between the two arms. Thus, even though convenience sampling makes estimating group-specific 95% CI very challenging, randomization licenses the estimation of 95% CI for outcome differences between the groups. The hazard ratio (HR) is a metric that is commonly used to quantify survival differences between groups in medical RCTs (Msaouel, Jimenez-Fonseca et al., 2022). Accordingly, the reported HR for overall survival in the trastuzumab deruxtecan RCT was 0.64 with 95% CI of 0.49 to 0.84 (Modi et al., 2022). Accurately estimating such comparative 95% CI is the primary task of RCTs. However, the frequent reporting of group-specific estimates in many trials suggests that there is clinical interest in reliably estimating how these outcomes can be extrapolated to external populations.
While estimating comparative differences in outcomes between interventions in RCTs is an important first step, generalizing and transporting inferences to external populations is the underlying reason why these studies are being conducted. These tasks are undeniably hard, but we need to face them head-on. Recent examples include the use of RCT data to inform vaccine policy during the COVID-19 pandemic. The BNT162b2 mRNA COVID-19 vaccine, also known as the ‘Pfizer’ vaccine, was compared with placebo in a pivotal phase 3 RCT that enrolled 43,548 individuals, of whom 43,448 actually received injections: 21,720 with BNT162b2 and 21,728 with placebo (Polack et al., 2020). The trial did not misleadingly provide unadjusted group-specific uncertainty estimates for the primary endpoint of vaccine efficacy. Instead, it reported the comparative estimate of relative vaccine efficacy, which was 95% with a 95% credible interval of 90.3 to 97.6%. But the medical community was less well-prepared to extrapolate these comparative findings across different populations and then rigorously incorporate risk-benefit trade-off considerations to inform health policy and clinical practice.
Similar to the contemporary polling challenges (Bailey, 2023; Meng, 2018) the convenience sampling used in the BNT162b2 vaccine RCT will result in a data defect correlation ρR,Y that is very different from zero. This is because it is very plausible that those who have access to and would be willing to enroll in this RCT are likely to have different COVID-19 outcomes than other populations even after controlling for demographics or other covariates within the enrolled sample (Chaudhary et al., 2021). Nevertheless, even standard weighting approaches accounting for these demographic variables may be preferable to the commonly used approach of naïvely presenting unadjusted group-specific estimates or subgroup analyses of comparative effects. Bailey (2023) also shows the value of other strategies such as random contact, which produce nonrandom samples but have inferential advantages compared with standard convenience sampling. Random contact for RCT enrollment will be challenging, particularly in countries without uniform access to health care, but the benefits may be worthwhile in certain contexts. As suggested by Bailey (2023), observational ‘real-world’ studies may also be used to account for the data defect correlation. An interesting question here would be how the expected value of the outcome Y differs, conditional on covariates, between RCTs and rigorously conducted observational studies (Hernan et al., 2022; Hulme et al., 2023). In some scenarios, theoretical rigor may be increased by randomly selecting from a population of interest the observational sample of patients to be studied. Such efforts may also be facilitated by a vibrant, independently emerging literature on how to combine knowledge from different studies to inform generalizability and transportability (Bareinboim & Pearl, 2016; Breskin et al., 2021).
In summary, the new paradigms described by Bailey (2023) are general enough to have potential clinical applications. These fertile ideas may help us address challenges that are deeply rooted and overlooked in medicine. I look forward to further research in this direction.
I would like to thank Bora Lim and the discussants in the datamethods forum, including Erin Smith, Robert Ryley, Christopher Tong, Sander Greenland, and Frank Harrell, for helpful conversations. All errors herein are solely mine.
Pavlos Msaouel has received honoraria for service on scientific advisory boards for Mirati Therapeutics, Bristol Myers Squibb, and Exelixis; consulting for Axiom Healthcare Strategies; nonbranded educational programs supported by Exelixis and Pfizer; and research funding for clinical trials from Takeda, Bristol Myers Squibb, Mirati Therapeutics, Gateway for Cancer Research, and the University of Texas MD Anderson Cancer Center.
Bailey, M. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion problem. PNAS, 113(27), 7345–7352. https://doi.org/10.1073/pnas.1510507113
Bareinboim, E., & Pearl, J. (2021). Transportability of causal effects: Completeness results. Proceedings of the AAAI Conference on Artificial Intelligence, 26(1), 698–704. https://doi.org/10.1609/aaai.v26i1.8232
Behring, M., Hale, K., Ozaydin, B., Grizzle, W. E., Sodeke, S. O., & Manne, U. (2019). Inclusiveness and ethical considerations for observational, translational, and clinical cancer health disparity research. Cancer, 125(24), 4452–4461. https://doi.org/10.1002/cncr.32495
Breskin, A., Cole, S. R., Edwards, J. K., Brookmeyer, R., Eron, J. J., & Adimora, A. A. (2021). Fusion designs and estimators for treatment effects. Statistics in Medicine, 40(13), 3124–3137. https://doi.org/10.1002/sim.8963
Brookes, S. T., Whitely, E., Egger, M., Smith, G. D., Mulheran, P. A., & Peters, T. J. (2004). Subgroup analyses in randomized trials: Risks of subgroup-specific analyses; power and sample size for the interaction test. Journal of Clinical Epidemioly, 57(3), 229–236. https://doi.org/10.1016/j.jclinepi.2003.08.009
Chaudhary, S., Benzaquen, S., Woo, J. G., Rubinstein, J., Matta, A., Albano, J., De Joy, R. 3rd, Lo, K. B., & Patarroyo-Aponte, G. (2021). Clinical characteristics, respiratory mechanics, and outcomes in critically ill individuals with COVID-19 infection in an underserved urban population. Respiratory Care, 66(6), 897–908. https://doi.org/10.4187/respcare.08319
Dahabreh, I. J., Robertson, S. E., Steingrimsson, J. A., Stuart, E. A., & Hernan, M. A. (2020). Extending inferences from a randomized trial to a new target population. Statistics in Medicine, 39(14), 1999–2014. https://doi.org/10.1002/sim.8426
Degtiar, I., & Rose, S. (2023). A review of generalizability and transportability. Annual Review of Statistics and Its Application, 10(1), 501−524. https://doi.org/10.1146/annurev-statistics-042522-103837
Dienstmann, R., Rodon, J., & Tabernero, J. (2013). Biomarker-driven patient selection for early clinical trials. Current Opinion in Oncology, 25(3), 305–312. https://doi.org/10.1097/CCO.0b013e32835ff3cb
Ebrahim, S., & Davey Smith, G. (2013). Commentary: Should we always deliberately be non-representative? International Journal of Epidemiology, 42(4), 1022–1026. https://doi.org/10.1093/ije/dyt105
Edgington, E., & Onghena, P. (2007). Randomization tests (4th ed.). Chapman and Hall/CRC. https://doi.org/10.1201/9781420011814
Eisenhauer, E. A. (2001). From the molecule to the clinic—inhibiting HER2 to treat breast cancer. New England Journal of Medicine, 344(11), 841–842. https://doi.org/10.1056/NEJM200103153441110
Fisher, R. A. (1949). The design of experiments (5th ed.). Oliver & Boyd.
Garralda, E., Dienstmann, R., Piris-Gimenez, A., Brana, I., Rodon, J., & Tabernero, J. (2019). New clinical trial designs in the era of precision medicine. Molecular Oncology, 13(3), 549–557. https://doi.org/10.1002/1878-0261.12465
Gelman, A., Hill, J., & Vehtari, A. (2020). Regression and other stories. Cambridge University Press. https://doi.org/10.1017/9781139161879
Greenland, S. (1990). Randomization, statistics, and causal inference. Epidemiology, 1(6), 421–429. https://doi.org/10.1097/00001648-199011000-00003
Hayes, D. F. (2019). HER2 and breast cancer —A phenomenal success story. New England Journal of Medicine, 381(13), 1284–1286. https://doi.org/10.1056/NEJMcibr1909386
Hernan, M. A., Wang, W., & Leaf, D. E. (2022). Target trial emulation: A framework for causal inference from observational data. JAMA, 328(24), 2446–2447. https://doi.org/10.1001/jama.2022.21383
Hulme, W. J., Williamson, E., Horne, E. M. F., Green, A., Mcdonald, H. I., Walker, A. J., Curtis, H. J., Morton, C. E., Mackenna, B., Croker, R., Mehrkar, A., Bacon, S., Evans, D., Inglesby, P., Davy, S., Bhaskaran, K., Schultze, A., Rentsch, C. T., Tomlinson, L., ... Sterne, J. a. C. (2023). Challenges in estimating the effectiveness of COVID-19 vaccination using observational data. Annals of Internal Medicine, 176(5), 685–693. https://doi.org/10.7326/M21-4269
Melamud, E., Taylor, D. L., Sethi, A., Cule, M., Baryshnikova, A., Saleheen, D., Van Bruggen, N., & Fitzgerald, G. A. (2020). The promise and reality of therapeutic discovery from large cohorts. Journal of Clinical Investigation, 130(2), 575–581. https://doi.org/10.1172/JCI129196
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-aoas1161sf
Meng, X.-L. (2022). Double your variance, dirtify your bayes, devour your pufferfish, and draw your kidstrogram. The New England Journal of Statistics in Data Science, 1(1), 4–23. https://doi.org/10.51387/22-NEJSDS6
Modi, S., Jacot, W., Yamashita, T., Sohn, J., Vidal, M., Tokunaga, E., Tsurutani, J., Ueno, N. T., Prat, A., Chae, Y. S., Lee, K. S., Niikura, N., Park, Y. H., Xu, B., Wang, X., Gil-Gil, M., Li, W., Pierga, J. Y., Im, S. A., ... Cameron, D. A. (2022). Trastuzumab deruxtecan in previously treated HER2-low advanced breast cancer. New England Journal of Medicine, 387(1), 9–20. https://doi.org/10.1056/NEJMoa2203690
Msaouel, P. (2021). Impervious to randomness: Confounding and selection biases in randomized clinical trials. Cancer Investigation, 39(10), 783–788. https://doi.org/10.1080/07357907.2021.1974030
Msaouel, P. (2022). The big data paradox in clinical practice. Cancer Investigation, 40(7), 567–576. https://doi.org/10.1080/07357907.2022.2084621
Msaouel, P., Hong, A. L., Mullen, E. A., Atkins, M. B., Walker, C. L., Lee, C. H., Carden, M. A., Genovese, G., Linehan, W. M., Rao, P., Merino, M. J., Grodman, H., Dome, J. S., Fernandez, C. V., Geller, J. I., Apolo, A. B., Daw, N. C., Hodges, H. C., Moxey-Mims, M., ... Tannir, N. M. (2019). Updated recommendations on the diagnosis, management, and clinical trial eligibility criteria for patients with renal medullary carcinoma. Clinical Genitourinary Cancer, 17(1), 1–6. https://doi.org/10.1016/j.clgc.2018.09.005
Msaouel, P., Jimenez-Fonseca, P., Lim, B., Carmona-Bayonas, A., & Agnelli, G. (2022). Medicine before and after David Cox. European Journal of Internal Medicine, 98, 1–3. https://doi.org/10.1016/j.ejim.2022.02.022
Msaouel, P., Lee, J., Karam, J. A., & Thall, P. F. (2022). A causal framework for making individualized treatment decisions in oncology. Cancers, 14(16), Article 3923. https://doi.org/10.3390/cancers14163923
Polack, F. P., Thomas, S. J., Kitchin, N., Absalon, J., Gurtman, A., Lockhart, S., Perez, J. L., Perez Marc, G., Moreira, E. D., Zerbini, C., Bailey, R., Swanson, K. A., Roychoudhury, S., Koury, K., Li, P., Kalina, W. V., Cooper, D., Frenck, R. W., Jr., Hammitt, L. L., ... Gruber, W. C. (2020). Safety and efficacy of the BNT162b2 mRNA Covid-19 vaccine. New England Journal of Medicine, 383(27), 2603–2615. https://doi.org/10.1056/NEJMoa2034577
Pound, P., & Ritskes-Hoitinga, M. (2018). Is it possible to overcome issues of external validity in preclinical animal research? Why most animal models are bound to fail. Journal of Translational Medicine, 16(1), Article 304. https://doi.org/10.1186/s12967-018-1678-1
Richiardi, L., Pizzi, C., & Pearce, N. (2013). Commentary: Representativeness is usually not necessary and often should be avoided. International Journal of Epidemiology, 42(4), 1018–1022. https://doi.org/10.1093/ije/dyt103
Rosenberger, W. F., Uschner, D., & Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine, 38(1), 1–12. https://doi.org/10.1002/sim.7901
Rothman, K. J., Gallacher, J. E., & Hatch, E. E. (2013a). Rebuttal: When it comes to scientific inference, sometimes a cigar is just a cigar. International Journal of Epidemiology, 42(4), 1026–1028. https://doi.org/10.1093/ije/dyt124
Rothman, K. J., Gallacher, J. E., & Hatch, E. E. (2013b). Why representativeness should be avoided. International Journal of Epidemiology, 42(4), 1012–1014. https://doi.org/10.1093/ije/dys223
Senn, S. (2004). Controversies concerning randomization and additivity in clinical trials. Statistics in Medicine, 23(24), 3729–3753. https://doi.org/10.1002/sim.2074
Senn, S. (2021). Statistical issues in drug development. John Wiley and Sons. https://doi.org/10.1002/9781119238614
Shapiro, D. D., Soeung, M., Perelli, L., Dondossola, E., Surasi, D. S., Tripathi, D. N., Bertocchio, J. P., Carbone, F., Starbuck, M. W., Van Alstine, M. L., Rao, P., Katz, M. H. G., Parker, N. H., Shah, A. Y., Carugo, A., Heffernan, T. P., Schadler, K. L., Logothetis, C., Walker, ... Msaouel, P. (2021). Association of high-intensity exercise with renal medullary carcinoma in individuals with sickle cell trait: Clinical observations and experimental animal studies. Cancers, 13(23), Article 6022. https://doi.org/10.3390/cancers13236022
Soeung, M., Perelli, L., Chen, Z., Dondossola, E., Ho, I. L., Carbone, F., Zhang, L., Khan, H., Le, C. N., Zhu, C., Peoples, M. D., Feng, N., Jiang, S., Zacharias, N. M., Minelli, R., Shapiro, D. D., Deem, A. K., Gao, S., ... Genovese, G. (2023). SMARCB1 regulates the hypoxic stress response in sickle cell trait. PNAS, 120(21), Article e2209639120. https://doi.org/10.1073/pnas.2209639120
Woolsey, T. D., Cochran, W. G., Mainland, D., Martin, M. P., Moore, F. E. Jr., & Patton, R. E. (1954). On the use of sampling in the field of public health. American Journal of Public Health, 44(6), 719–740. https://doi.org/10.2105/ajph.44.6.719
©2023 Pavlos Msaouel. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.