Skip to main content
SearchLoginLogin or Signup

Decoding Randomized Controlled Trials: An Information Science Perspective

Published onApr 30, 2024
Decoding Randomized Controlled Trials: An Information Science Perspective

Column Editor’s Note: Randomized controlled trials (RCTs) are regarded as a form of ‘gold-standard’ for causal inferences in real-world settings. Real-world RCTs involve key design and analysis decisions that can strongly affect the meaning and validity of conclusions drawn. Dr. Msaouel explores some of these issues and questions from an information science perspective, shedding light on, among other aspects, random sampling and randomization, refutational information, and Bayesian shrinkage.

Keywords: randomized controlled trials, information theory, s values, hypothesis testing

Data Detective: What Mechanisms Generated the Observed Results?

Randomized controlled trials (RCTs) are experiments comparing two or more interventions. Originally invented by Ronald A. Fisher in the 1920s as a strategy to make reliable inferences in agricultural experiments (Fisher, 1928), RCTs were then brought to medicine in 1946 to establish the validity of streptomycin as the first antibiotic cure for tuberculosis (Armitage, 2003; Rosenberger et al., 2019). Their success in determining the effectiveness of interventions ultimately led to the adoption of RCTs across many other fields such as psychology, economics, education, marketing, and business (Bojinov & Gupta, 2022; Dominici et al., 2021; Hoynes, 2023; Medical Research Council, 1948). In medicine alone, approximately 140 RCTs are published daily, comprising a vast collection of data that need to be efficiently interpreted (Marshall et al., 2020).

The increasing complexity of RCT designs further requires readers to concentrate on the most informative signals, while steering clear of noisy outputs. The first question to ask is, how were the data generated? This includes considering whether the trial used the proper concurrent controls and specified meaningful endpoints to estimate, and whether there are unaccounted sources of systematic errors such as improperly selected patients participating in the study, nonadherence to the assigned intervention, or performance bias due to the participants or investigators being aware of the assigned intervention, which can inadvertently or deliberately influence their behavior (Mansournia et al., 2017; Msaouel et al., 2023; Senn, 2021). The entire data life cycle also needs to be considered, including whether data are missing and how they were captured, processed, and linked into the final RCT data set (Christen & Schnell, 2023; Meng, 2021). All these can influence the internal validity of an RCT, that is, the extent to which the observed results can be attributed to the experimental interventions as opposed to other factors. No amount of statistical tinkering can salvage a poorly designed and conducted experiment.

RCTs use randomness to generate information. The type of random procedure used guides the statistical inferences we can make based on the observed data (Greenland, 1990; Msaouel, 2023; Msaouel et al., 2023). Random sampling allows us to generalize our inferences from the sample to the broader population. For example, if we want to determine the weight loss at 68 weeks from treatment initiation in the population of individuals in the United States who have been prescribed the drug semaglutide, we can obtain a random sample from that population and generate an estimate, along with uncertainty measures such as 95% confidence intervals (CIs), of the weight loss at 6 months for this treatment group. The practical challenges of performing such random sampling and strategies to address them are discussed in Bailey (2023, 2024).

Randomization is distinct from random sampling and refers to the random allocation of the interventions that are being compared in an RCT. Whereas random sampling is used for group-specific inferences, randomization allows us to perform comparative inferences. For example, a double-blind RCT found that semaglutide demonstrated greater weight reduction at 68 weeks compared with placebo control (treatment difference of −12.4 percentage points with 95% CI of −13.4 to −11.5) (Wilding et al., 2021).

It is possible in certain scenarios to perform both random sampling and randomization in an RCT. For example, we can randomly sample a list of voters from a population of interest and then randomly assign them to receive or not receive voter turnout encouragement mails. However, medical RCTs almost always use convenience samples enrolled based on the trial entry criteria and other considerations such as the participants’ ability to access the trial and their willingness to consent to trial entry. Therefore, group-specific results such as the mean survival time for a treatment arm in a medical RCT offer limited insight as estimates of parameters for broader patient populations.

Conversely, comparative estimates represented by differences, for example, in mean survival probability, or by ratios such as risk ratios (RR), odds ratios (OR), or hazard ratios (HR) are the most informative outputs of RCTs that use convenience samples. Transporting this comparative knowledge to other populations, such as the patients a clinician sees in their clinic, often requires contextual knowledge of these target populations beyond the information generated by RCTs, and is an evolving topic of methodological research in statistics, epidemiology, and computer science (Bareinboim & Pearl, 2016; Dahabreh & Hernan, 2019; Dahabreh et al., 2020; Goldstein et al., 2020; Msaouel, 2023; Msaouel et al., 2022). Herein, we will focus exclusively on what we can learn from RCTs alone using practical approaches that help separate informative signals from noise.

Converting RCT Outputs Into Information

The philosopher of science Karl Popper noted that the information produced by research studies can only support a hypothesis in relation to other competing hypotheses, whereas it can refute a hypothesis even in the absence of an alternative explanation (Greenland, 1998; Popper, 1963). Therefore, refuting a hypothesis requires fewer assumptions than confirming it. The information provided by RCTs is accordingly often used to refute tested hypotheses of interest, such as the null hypothesis of no difference between the randomized interventions.

The frequentist p values often reported in RCTs are a purely refutational output defined as the probability that the chosen test statistic would be equal or more extreme than what was observed if the tested hypothesis holds and all the assumptions used to compute it are correct. This arcane definition has long baffled readers of RCTs leading to gross misinterpretations (Greenland et al., 2016). A more intuitive way to interpret p values is to convert them into binary digits (bits) of refutational information, also known as s values, Shannon information values, or surprise values and defined as s = −log2(p) (Mansournia et al., 2022; Rafi & Greenland, 2020). Online calculators are available to transform p values to s values and statistical software packages such as GraphPad Prism have started to include s values in their outputs. Converting probabilities into bits of information offers essential insights for a wide range of activities, from interpreting RCTs to optimizing strategies in games like Wordle (see example

The obtained p value from an RCT is no more surprising than fairly flipping a coin s times and observing that every toss resulted in tails. Note that s values measure information against all underlying statistical model assumptions, including the tested hypothesis. For simplicity, we will assume here that all other statistical model assumptions are correct and thus the s values will measure our surprise exclusively against the tested hypothesis, typically the null hypothesis of no difference.

Larger s values correspond to greater surprise and thus stronger evidence to refute the assumption that the coin is tossed fairly, which corresponds to the assumption that the null hypothesis is correct. For example, a phase 3 RCT in 4,304 patients with chronic kidney disease reported that the drug dapagliflozin improved survival compared with placebo control (HR = 0.69, 95% CI 0.53 to 0.88, p = .004) (Heerspink et al., 2020). This p value corresponds to s = −log2(.004) ≈ 8 bits of refutational information against the null hypothesis that HR = 1.0. This is equivalent to the degree of surprise that should result from observing all tails in eight fair coin tosses.

Because bits are the smallest discrete measure of information, s values are typically rounded to the nearest integer. Therefore p = .049, .051, and .06 all supply approximately four bits of surprise and thus carry roughly similar information even though only p = .049 is below the conventional threshold of .05 often chosen arbitrarily to denote ‘statistical significance.’ Conversely, p = .25 supplies only two bits of surprise, which is only half of the information provided by p = .06.

The commonly used p value threshold of .05 only equates to approximately four bits of refutational information, equivalent to our surprise after obtaining four tails in four presumably fair coin tosses. By contrast, in particle physics, the standard for significance is much higher, at 22 bits of refutational information, equivalent to a p value of ≤ 2.87 × 10-7 (Junk & Lyons, 2020). This is akin to fairly flipping a coin 22 times and getting tails on every toss.

s values can also be used to intuitively interpret frequentist CIs, another RCT output that often perplexes readers (Greenland et al., 2016). The common misinterpretation is to infer that the true value of the comparative effect has a 95% probability of being contained within the observed 95% CI. In fact, an observed frequentist 95% CI will either contain or not contain the true value. The 95% only refers to the frequency with which the true effect would be contained in unobserved 95% CIs computed across many RCTs if all the assumptions used to compute the CI are correct. This definition can become clearer and pertinent to the factual RCT results, rather than imagining a series of hypothetical RCTs, if we consider that a frequentist 95% CI corresponds to a p value threshold of 1 – .95 = .05 which can be transformed to s = −log2(.05) ≈ 4 bits. Therefore, assuming that the assumptions used to generate a frequentist 95% CI are correct, this interval contains values against which there are no more than four bits of refutational information based on the RCT data.

For example, in the aforementioned RCT of dapagliflozin versus placebo in patients with chronic kidney disease (Heerspink et al., 2020), the 95% CI of 0.53 to 0.88 suggest that HR values within that interval are at most as surprising as observing all tails in four fair coin tosses. HR values outside this interval have more than four bits of refutational information against them, whereas the point estimate value HR = 0.69 has the least amount of information against it. Likewise, frequentist 99% CI, which correspond to a p value cutoff of 1 − .99 = .01, contain values against which there are no more than s = −log2(.01) ≈ 7 bits of refutational information. Several recent reviews offer further guidance on how to transform statistical outputs into intuitive information measures (Amrhein et al., 2019; Cole et al., 2021; Greenland et al., 2022; Mansournia et al., 2022; Rafi & Greenland, 2020).

Trinary Hypothesis Testing

Figure 1 illustrates three distinct RCT scenarios (the simulation code is available in the Appendix) that can easily be discerned by focusing on the comparative differences between the treatment and control groups. The survival plots were generated using the survplotp function from the rms package in R, version 4.3.0 (R Core Team, 2023). (See the tutorial video at These plots visualize the time points at which the p value for differences between groups is <.05, as indicated by the shaded gray polygon not crossing the survival curves. This gray polygon represents the half-width confidence bands for the cumulative event curve difference under the assumption that there is no treatment effect difference and is therefore a more appropriate visualization of RCT uncertainty than showing the confidence bands for each survival curve (Boers, 2004). Instances where the confidence bands intersect with the survival curves correspond to a p value > .05.

Narrow confidence bands suggest lower uncertainty, while wider areas imply greater uncertainty in estimating the differences between the groups. Careful inspection of the data can inform decisions on whether to reject (Figure 1A) or accept (Figure 1B) the null hypothesis of no difference between the treatment and control groups, or to deem the RCT data as inconclusive (Figure 1C). This trinary framework is more comprehensive and informative than the traditional convention of either accepting or rejecting the null hypothesis, which cannot distinguish between the two very different scenarios shown in Figures 1B and 1C.

Figure 1. Kaplan-Meier survival plots from three simulated randomized controlled trial (RCT) scenarios. Survival differences are represented by the half-width confidence bands (shaded gray polygon) corresponding to the midpoint of the survival estimates of the treatment and control group ± the half-width of the 95% confidence interval (CI) for the difference in each group’s Kaplan-Meier probability estimates. The half-width confidence bands are positioned equidistant from the two group’s midpoints, ensuring that if a confidence band intersects one survival curve, it will intersect the other as well. At the time points where the half-width confidence bands intersect the survival curves, the p value is greater than .05 (without adjustment for multiplicity) for the null hypothesis of no treatment group difference. A univariable Cox proportional hazards model was used to estimate hazard ratios (HR) and their CIs, p values, and s values. (A) This RCT demonstrates a consistently strong signal of survival difference between the treatment and control groups. The very narrow confidence bands indicate that the information yielded is high. (B) This RCT starts with the same sample size but there is substantial censoring resulting in wider confidence bands due to loss of information. Nevertheless, the confidence bands remain narrow enough to suggest adequate signal-to-noise ratio and it consistently crosses the survival curves suggesting that the data do not refute the null hypothesis of no treatment difference. (C) The signal-to-noise ratio in this RCT is very low as evidenced by the long width of the confidence bands throughout the study. The low sample size and considerable censoring hinder our ability to make any reliable inferences from this inconclusive RCT.

Bayesian Shrinkage

RCT results can be analyzed using either frequentist or Bayesian models. We focused here on frequentist outputs because they are more commonly used (Hahn et al., 2022; Tidwell et al., 2019). An advantage of Bayesian approaches is that they can use prior distributions to constrain the observed effects of RCTs. Such shrinkage of published RCT results can be useful because studies that yield negative results are less likely to be accepted by medical journals. Furthermore, many published RCTs have a low signal-to-noise ratio, defined as the ratio of the true treatment effect to the standard error of its estimate (van Zwet et al., 2023). Consequently, the effect sizes reported in published RCTs are often biased toward higher values, leading to an overestimation of the actual differences in treatment efficacy (Ioannidis, 2008).

A recent analysis of 23,551 medical RCTs from the Cochrane Database of Systematic Reviews (CDSR) offered an empirical basis to create a prior distribution that can account for the expected exaggeration effect observed in published RCTs (van Zwet & Cator, 2021; van Zwet, Schwab, & Greenland, 2021; van Zwet, Schwab, & Sen, 2021). If a published RCT meets the quality standards used for inclusion in the CDSR then this prior distribution can be plausibly used to shrink the reported frequentist estimate and thus mitigate the exaggeration effect. A free web-based application can be used to perform this Bayesian shrinkage of frequentist data reported from RCTs.

In the phase 3 RCT comparing dapagliflozin versus placebo in patients with chronic kidney disease (Heerspink et al., 2020), the Bayesian shrinkage yields a posterior mean HR of 0.76 with 95% posterior credible intervals 0.59 to 0.97 (Figure 2). This credible interval is more intuitive than frequentist CIs and indicates that, given this prior and the assumed statistical model, the probability that the HR falls between 0.59 and 0.97 given the data is 0.95. The posterior probability that the HR is larger than 1.0 is 0.011. This can be converted to a posterior surprise of -log2(0.011) ≈ 7 bits of information. This represents the surprise we would have at finding out that HR > 1.0 after having seen the data.

Next Steps

Comparative inferences are the main focus of RCTs that randomize the interventions but do not randomly select the enrolled sample. This knowledge will then need to be transported to populations that were not necessarily represented in the RCT sample. Data science frameworks are being developed to inform this task (Bareinboim & Pearl, 2016; Dahabreh et al., 2020; Msaouel, 2023; Msaouel et al., 2022). A key assumption is that the target populations will share with the RCT sample relevant biological and other causal properties that influence the comparative effect. This assumption is commonly used throughout science to transport inferences yielded by experiments.

Disclosure Statement

Pavlos Msaouel has received honoraria for service on scientific advisory boards for Mirati Therapeutics, Bristol Myers Squibb, and Exelixis; consulting for Axiom Healthcare Strategies; nonbranded educational programs supported by DAVA Oncology, Exelixis, and Pfizer; and research funding for clinical trials from Takeda, Bristol Myers Squibb, Mirati Therapeutics, Gateway for Cancer Research, and the University of Texas MD Anderson Cancer Center.


Amrhein, V., Trafimow, D., & Greenland, S. (2019). Inferential statistics as descriptive statistics: There is no replication crisis if we don’t expect replication. The American Statistician, 73(sup1), 262–270.

Armitage, P. (2003). Fisher, Bradford Hill, and randomization. International Journal of Epidemiology, 32(6), 925–928.

Bailey, M. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3).

Bailey, M. A. (2024). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press.

Bareinboim, E., & Pearl, J. (2016). Causal inference and the data-fusion problem. PNAS, 113(27), 7345–7352.

Boers, M. (2004). Null bar and null zone are better than the error bar to compare group means in graphs. Journal of Clinical Epidemiology, 57(7), 712–715.

Bojinov, I., & Gupta, S. (2022). Online experimentation: Benefits, operational and methodological challenges, and scaling guide. Harvard Data Science Review, 4(2).

Christen, P., & Schnell, R. (2023). Thirty-three myths and misconceptions about population data: From data capture and processing to linkage. International Journal for Population Data Science, 8(1), Article 2115.

Cole, S. R., Edwards, J. K., & Greenland, S. (2021). Surprise! American Journal of Epidemiology, 190(2), 191–193.

Dahabreh, I. J., & Hernan, M. A. (2019). Extending inferences from a randomized trial to a target population. European Journal of Epidemiology, 34(8), 719–722.

Dahabreh, I. J., Robertson, S. E., Steingrimsson, J. A., Stuart, E. A., & Hernan, M. A. (2020). Extending inferences from a randomized trial to a new target population. Statistics in Medicine, 39(14), 1999–2014.

Dominici, F., Bargagli-Stoffi, F. J., & Mealli, F. (2021). From controlled to undisciplined data: Estimating causal effects in the era of data science using a potential outcome framework. Harvard Data Science Review, 3(3).

Fisher, R. A. (1928). Statistical methods for research workers. Oliver and Boyd.

Goldstein, N. D., LeVasseur, M. T., & McClure, L. A. (2020). On the convergence of epidemiology, biostatistics, and data science. Harvard Data Science Review, 2(2).

Greenland, S. (1990). Randomization, statistics, and causal inference. Epidemiology, 1(6), 421–429.

Greenland, S. (1998). Induction versus Popper: Substance versus semantics. International Journal of Epidemiology, 27(4), 543–548.

Greenland, S., Mansournia, M. A., & Joffe, M. (2022). To curb research misreporting, replace significance and confidence by compatibility: A Preventive Medicine Golden Jubilee article. Preventive Medicine, 164, Article 107127.

Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: A guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.

Hahn, A. W., Dizman, N., & Msaouel, P. (2022). Missing the trees for the forest: most subgroup analyses using forest plots at the ASCO annual meeting are inconclusive. Therapeutic Advances in Medical Oncology, 14.

Heerspink, H. J. L., Stefansson, B. V., Correa-Rotter, R., Chertow, G. M., Greene, T., Hou, F. F., Mann, J. F. E., Mcmurray, J. J. V., Lindberg, M., Rossing, P., Sjostrom, C. D., Toto, R. D., Langkilde, A. M., & Wheeler, D. C., for the DAPA-CKD Trial Committees and Investigators. (2020). Dapagliflozin in patients with chronic kidney disease. New England Journal of Medicine, 383(15), 1436–1446.

Hoynes, H. (2023). Reproducibility in economics: Status and update. Harvard Data Science Review, 5(3).

Ioannidis, J. P. (2008). Why most discovered true associations are inflated. Epidemiology, 19(5), 640–648.

Junk, T. R., & Lyons, L. (2020). Reproducibility and replication of experimental particle physics results. Harvard Data Science Review, 2(4).

Mansournia, M. A., Higgins, J. P., Sterne, J. A., & Hernan, M. A. (2017). Biases in randomized trials: A conversation between trialists and epidemiologists. Epidemiology, 28(1), 54–59.

Mansournia, M. A., Nazemipour, M., & Etminan, M. (2022). P-value, compatibility, and S-value. Global Epidemiology, 4, Article 100085.

Marshall, I. J., Nye, B., Kuiper, J., Noel-Storr, A., Marshall, R., Maclean, R., Soboczenski, F., Nenkova, A., Thomas, J., & Wallace, B. C. (2020). Trialstreamer: A living, automatically updated database of clinical trial reports. Journal of the American Medical Informatics Association, 27(12), 1903–1912.

Medical Research Council. (1948). Streptomycin treatment of pulmonary tuberculosis. British Medical Journal, 2(4582), 769–782.

Meng, X.-L. (2021). Enhancing (publications on) data quality: Deeper data minding and fuller data confession. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4), 1161–1175.

Msaouel, P. (2023). The role of sampling in medicine. Harvard Data Science Review, 5(3).

Msaouel, P., Lee, J., Karam, J. A., & Thall, P. F. (2022). A causal framework for making individualized treatment decisions in oncology. Cancers, 14(16), Article 3923.

Msaouel, P., Lee, J. & Thall, P. F. (2023). Interpreting Randomized Controlled Trials. Cancers, 15(19), Article 4674.

Popper, K. R. (1963). Conjectures and refutations: The growth of scientific knowledge. Routledge & Kegan Paul.

R Core Team. (2023). R: A language and environment for statistical computing. R Foundation for Statistical Computing.

Rafi, Z., & Greenland, S. (2020). Semantic and cognitive tools to aid statistical science: Replace confidence and significance by compatibility and surprise. BMC Medical Research Methodology, 20(1), Article 244.

Rosenberger, W. F., Uschner, D., & Wang, Y. (2019). Randomization: The forgotten component of the randomized clinical trial. Statistics in Medicine, 38(1), 1–12.

Senn, S. (2021). Statistical issues in drug development. John Wiley and Sons.

Tidwell, R. S. S., Peng, S. A., Chen, M., Liu, D. D., Yuan, Y., & Lee, J. J. (2019). Bayesian clinical trials at The University of Texas MD Anderson Cancer Center: An update. Clinical Trials, 16(6), 645–656.

van Zwet, E. W., & Cator, E. A. (2021). The significance filter, the winner's curse and the need to shrink. Statistica Neerlandica, 75(4), 437–452.

van Zwet, E., Schwab, S., & Greenland, S. (2021). Addressing exaggeration of effects from single RCTs. Significance, 18(6), 16–21.

van Zwet, E., Schwab, S. & Senn, S. (2021). The statistical properties of RCTs and a proposal for shrinkage. Statistics in Medicine, 40(27), 6107–6117.

van Zwet, E. W., Tian, L., & Tibshirani, R. (2023). Evaluating a shrinkage estimator for the treatment effect in clinical trials. Statistics in Medicine, 43(5), 855–868.

Wilding, J. P. H., Batterham, R. L., Calanna, S., Davies, M., Van Gaal, L. F., Lingvay, I., Mcgowan, B. M., Rosenstock, J., Tran, M. T. D., Wadden, T. A., Wharton, S., Yokote, K., Zeuthen, N., & Kushner, R. F., for the STEP1 Study Group. (2021). Once-weekly semaglutide in adults with overweight or obesity. New England Journal of Medicine, 384(11), 989–1002.


R code for generating the survival data used in Figure 1A:

# Install and Load Necessary Packages





# Set Parameters for Simulation

n <- 1000 # Total number of patients

hr <- 0.6 # Hazard ratio for treatment B relative to A

median_survival_A <- 365 # Median survival time for group A, e.g., 365 days

# Generate Survival Times and Treatment Groups

set.seed(123) # Setting seed for reproducibility

survival_times_A <- rexp(n/2, rate = log(2) / median_survival_A)

survival_times_B <- survival_times_A / hr # Adjusting for hazard ratio

# Create a Data Frame

treatment <- rep(c("A", "B"), each = n/2)

times <- c(survival_times_A, survival_times_B)

status <- rep(1, n) # Assuming all events are deaths

data <- data.frame(treatment, times, status)

# Perform Survival Analysis

fit <- coxph(Surv(times, status) ~ treatment, data = data)


R code for generating the survival data used in Figure 1B:

# Install and Load Necessary Packages





# Set Parameters for Simulation

n <- 1000 # Total number of patients

hr <- 0.95 # Hazard ratio for treatment 2 relative to treatment 1

lambda <- 0.1 # Baseline hazard rate

# Generate Survival Times and Treatment Groups

set.seed(123) # Setting seed for reproducibility

group <- sample(c(1, 2), n, replace = TRUE) # Randomly assign treatments

rate <- ifelse(group == 1, lambda, lambda / hr)

survival_times <- rexp(n, rate)

# Create a Data Frame

status <- sample(0:1, n, replace = TRUE) # Randomly generate event status (0 for censored, 1 for event)

data <- data.frame(patient_id = 1:n, treatment = group, time = survival_times, status = status)

# Perform Survival Analysis

fit <- coxph(Surv(time, status) ~ treatment, data = data)


R code for generating the survival data used in Figure 1C:

# Install and Load Necessary Packages



# Set Parameters for Simulation

n <- 100 # Total number of patients

lambda <- 0.1 # Baseline hazard rate

# Generate Treatment Groups

set.seed(123) # Seed for reproducibility

group <- sample(c(1, 2), n, replace = TRUE) # Treatment assignment

# Introduce random noise to the survival times

noise_factor <- runif(n, min = 0.1, max = 10)

survival_times <- rexp(n, rate = lambda) * noise_factor

# Generate event status

status <- sample(0:1, n, replace = TRUE)

# Create a Data Frame

data <- data.frame(patient_id = 1:n, treatment = group, time = survival_times, status = status)

# Perform Survival Analysis

fit <- coxph(Surv(time, status) ~ as.factor(treatment), data = data)


©2024 Pavlos Msaouel. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

No comments here
Why not start the discussion?