Column Editor’s Note: For decades, statisticians and scientific practitioners have written about the problems of using a strict cutoff threshold, such as having a p-value less than .05, for declaring ‘statistical significance’ of research results. The term ‘statistical significance’ is itself often misunderstood, as are p-values. Recent years have seen a renewed effort among some in the statistics community to bring these questions to the forefront and to propose new approaches for inference and for communicating statistical results. In this issue's Minding the Future column, Catherine Case of the University of Georgia discusses the changes she has implemented in her introductory statistics courses to help students move beyond ‘bright-line, p < .05’ thinking. Teachers of AP Statistics and college introductory statistics courses: Have you likewise tried to introduce your students to this current discussion in the literature? If so, consider writing a column about your experiences!
Keywords: statistics education; introductory statistics; K-12 education; significance; p-values
If you have taken an introductory statistics class in college or an equivalent high school class like AP (Advanced Placement) Statistics, you have likely spent many hours learning how to calculate p-values and make decisions about whether to ‘reject’ or ‘fail to reject’ the null hypothesis. As a statistics instructor who has taught at both the college and high school levels, I have spent many hours teaching students these skills. In particular I have reminded my AP Statistics students that their conclusion about a hypothesis test would be considered essentially correct if they provide justification based on whether the p-value is less than
Criticisms of the role of p-values and significance are certainly nothing new. For example, Berkson (1942) questioned the dominance of significance testing in statistics education and experimental science more than 80 years ago. Nickerson (2000) exhaustively reviewed decades worth of criticisms and described commonly held misconceptions, like the belief that a small p-value indicates replicability (or reliability) and a practically important effect. Castro Sotos et al. (2007) catalogued misconceptions of statistical inference, specifically in the population of university students. However, in 2019, The American Statistician published a landmark special issue calling on the profession to discard the dichotomous view of ‘statistical significance’ altogether and replace it with a more thoughtful approach to learning from data in light of uncertainty. In the opening editorial of the issue, Wasserstein et al. (2019) acknowledge, “Statistics education will require major changes at all levels to move to a post ‘p < 0.05’ world” (p. 8).
In this column, I will share an approach I have used to implement those changes in my introductory statistics classes. Of course, decisions about what is feasible in a particular class are highly context-dependent, but given that the curriculum of intro stats is notoriously crowded, my goal was to reframe interpretation of statistical evidence in the existing curriculum without adding much new content. I believe it is possible to move beyond p < .05 in introductory statistics, even working within the constraints of a defined curriculum (like AP Statistics or classes coordinated across multiple instructors).
The approach I describe in this column is based on backward design (Wiggins & McTighe, 1998)—a method of curriculum planning with three main stages: identifying desired results, determining acceptable evidence, and planning learning experiences and instruction.
When I set out to improve my classes, I started by reviewing the editorial of the special issue of The American Statistician, translating goals set forth by reformers into outcomes that are achievable in an introductory statistics class. The full list of outcomes (provided in the Appendix) includes 18 outcomes divided into five categories. Here I will quickly summarize each category:
P-values. I am in full agreement with Wasserstein et al. (2019, p. 2) when they say that “a reasonable prerequisite for reporting any p-value is the ability to interpret it appropriately.” I hope students will leave my class with an understanding of what a p-value measures and what it does not. Although we are moving away from dichotomous interpretations in class, I do want my students to understand what others mean when they use the term statistically significant and to recognize common misunderstandings associated with the term. However, other instructors may choose to avoid the term entirely, perhaps replacing it with something like statistically discernible (Witmer, 2023).
Errors and power. In scenarios where dichotomous decisions are necessary, I want students to make an informed choice of significance level based on understanding of Type I and Type II errors and their consequences. We also discuss factors that affect Type I and Type II error rates and appropriate strategies to increase statistical power.
Effect sizes. Depending on the course, I may or may not introduce formal measures like Cohen’s d, but students can still learn to use basic summary statistics as measures of effect size. Further, calculating a p-value to test a minimal important effect size (instead of the null hypothesis) is a reasonable extension in an introductory statistics class.
The research process. Students learn that classifying results as ‘significant’ or ‘not significant’ can lead to selective reporting and distortions in the research literature. We discuss the potential benefits of approaches like public preregistration of methods and results-blind reviewing. We also discuss the benefits of open science practices and replication.
Note that this is not a comprehensive set of outcomes for an introductory statistics class. The main goal of this article is to outline an approach to teaching statistical inference that goes beyond dichotomous decision-making. However, the framework of backward design can be useful for curriculum design more generally, especially if you need to consider leaving out some topics to make space for others.
Changes to the outcomes necessitate changes in assessment. Consider a few assessment items I have written for my classes that require a more nuanced understanding of inference. These include open-ended questions to assess students’ statistical communication as well as questions that can be graded more quickly.
Question 1 shares the results of a real study (Boers, Afzali, Newton, & Conrod, 2019) that have been reported to be ‘statistically significant.’ The multiselect question requires students to recognize and avoid common misunderstandings of this term.
A recent study published in JAMA Pediatrics found an association between screen time (time spent on social media and/or watching television) and depressive symptoms in adolescents. This conclusion was based on an observational study of nearly 4,000 adolescents.
The association between screen time and depressive symptoms was reported to be ‘statistically significant.’ Which of the following is an appropriate interpretation of this term? Select all that apply.
The hypothesis test for the association between screen time and depressive symptoms resulted in a small p-value.
The association observed in this sample would have been unlikely to occur by chance if there were really no association between screen time and depressive symptoms.
We have strong evidence that increasing screen time causes an increase in depressive symptoms.
The association between screen time and depressive symptoms is strong enough to be practically important.
Question 2, part (a) is a standard intro stat question that asks students to distinguish between Type I and Type II error. However, the question goes further in part (b) by asking students to make an informed choice of a significance level based on the consequences of those errors. In part (c), students also have an opportunity to explain why/how to supplement a p-value with a measure of effect size.
The manager of a fast food restaurant is planning a study to determine whether scheduling an additional employee reduces average wait time for customers in the drive-thru.
a. Because of budget concerns, they are worried about the possibility of hiring another employee when it does not actually reduce the average waiting time. In other words, they want to reduce the likelihood of ____ (Type I error / Type II error).
b. Based on the priorities identified above, should the manager use a significance level of
c. Write 1–2 sentences to explain to the manager why they should calculate a measure of effect size in addition to the p-value. Suggest a reasonable measure of effect size for this type of data.
Question 3, part (a) is another standard intro stat question. It asks students to calculate and interpret a confidence interval. However, part (b) goes further by asking students to recognize that the confidence interval itself is an estimate subject to error and to consider the difference between random and nonrandom errors.
A survey question asked, “Roughly how much money do you think you personally will spend on Christmas gifts this year?” and 1,019 randomly selected American adults responded to the survey. The mean response was $885 dollars, and the standard deviation was $324.
a. Calculate and interpret a 95% confidence interval for the mean amount Americans plan to spend on Christmas gifts this year.
b. Many people originally chosen for the sample could not be contacted or refused to participate. Does this threaten the validity of the confidence interval?
Yes, this may lead to nonresponse bias if the people who did not respond have different spending plans than those who responded.
Yes, because the confidence level is fairly low. To account for nonresponse bias, you should increase the confidence level to 99%.
No, because the margin of error accounts for nonresponse bias, which allows you to estimate the true population mean.
The final step is to plan learning experiences aligned with the target outcomes and revised assessments. For example, I have asked students to read the article “Science Isn’t Broken” from FiveThirtyEight (Aschwanden, 2015). The article includes a link to a web app that invites you to “hack your way to scientific glory” by tweaking terms and other analysis options until you find publishable evidence (p-value < .05) that the U.S. economy is affected by whether Republicans or Democrats are in office. The article itself describes the incentives that encourage p-hacking and the ways in which the scientific method is more complicated than people tend to imagine. I have also asked students to read just the first 10 pages of “Moving to a World Beyond p < 0.05” (Wasserstein et al., 2019). We follow up with a group discussion about why they recommend avoiding the phrase ‘statistically significant,’ what they recommend instead, and the institutional changes that are necessary for moving beyond statistical significance.
The activities described above were designed for the singular purpose of ‘moving beyond p < 0.05,’ but more often, I try to achieve the target outcomes through small modifications to the existing activities in an intro stats class. For example, as students are learning about inference, they can explore scenarios where common misconceptions about ‘significance’ break down. The scatterplot in Figure 1 displays measurements of body mass index and total cholesterol collected from 2,889 participants in the Framington Heart Study—a long-running longitudinal study that investigates the factors that contribute to cardiovascular disease (https://www.framinghamheartstudy.org/). Because of the large sample size, the p-value for testing the association between BMI and cholesterol is very small (p-value < .0001). However, BMI only explains a small percentage of the variability in cholesterol (R2 = 0.017).
Since I expect students to provide nuanced inferential interpretations as part of their assessments, they need more opportunities to practice their statistical communication and incorporate feedback. In my class, students often work in groups to analyze data and submit their interpretations through a classroom response system. (Specifically, I use Socrative, which is available online at https://www.socrative.com/, but there are plenty of options that allow open-ended responses, unlike old-fashioned clickers.) I select a few interpretations that are incorrect or incomplete and we work as a large group to improve them. Students become familiar with the high expectations that will be applied to their writing while building their statistical communication skills in a low-stakes environment.
To allow students to see the relevance of what they are learning, my classes have always included readings that contain statistical reports from research journals or popular media. I now go a little further by asking students to critique how ‘significant’ results are presented. For example, they may notice the tendency to mention small p-values in the abstract of a paper without mentioning the number of tests conducted. Or they may notice that pop science articles reporting on the latest attention-grabbing study often fail to mention effect size.
To echo Wasserstein et al. (2019, p. 1), “don’t is not enough.” Just as it is not enough to tell users of statistics what not to do with p-values, it is not enough to tell teachers to stop using the term ‘statistically significant’ in class. In many statistics classes, dichotomous interpretations of p-values are a central part of instruction and assessment, so teaching for a world beyond ‘p < .05’ may represent an intimidating change. However, I do believe that there are steps we can take to avoid establishing a black-and-white view of statistical significance in the next generation of data producers and consumers, even within the constraints of the introductory statistics curriculum.
Personally, I have been working on these changes in my classes over the past four years, and it is still a work in progress. The biggest driver of change has not been a once-per-semester conversation about the issues with significance but a restructuring of in-class data analysis on a daily basis. If the analysis always ends with calculating a p-value and stating a conclusion, it is natural for students to think that is the most important part! Following up with a measure of effect size and discussion of practical implications is a small change with a big impact. Using targeted assessments has taught me not to assume that understanding of the larger statistical process will come automatically. Even if students are proficient with hypothesis tests and confidence intervals, they may not understand the different purposes of each tool in an analysis. Even if they can write down a correct p-value interpretation, they may hold other misleading conceptions at the same time. Transitioning my classes away from dichotomous decision-making has not always been easy, but it is rewarding to see my students develop a more thoughtful and realistic approach to learning from data.
Thank you to my colleagues in the “Teaching Beyond p < 0.05” discussion group at the University of Georgia for challenging me to bring a more nuanced understanding of inference to my students.
The author does not have any relevant funding, grants, sponsorships, or potential conflicts of interest to disclose.
This is not a comprehensive set of outcomes for an introductory statistics class. These outcomes, inspired by the editorial “Moving to a World Beyond ‘p < 0.05’” (Wasserstein et al., 2019), describe a broader approach to statistical inference that goes beyond dichotomous decision-making.
Interpretation of P-values
A. Interpret p-values as continuous probabilities.
Avoid common misinterpretations of p-values. For example, a p-value does not provide the probability that the null hypothesis is true.
B. Use p-values to describe the strength of evidence against a stated hypothesis.
Explain why a small p-value provides strong evidence against a stated hypothesis.
C. Predict how changes to the effect size or sample size will impact the p-value.
D. Understand what others mean when they use the term statistically significant.
Recognize common misunderstandings associated with this term.
Errors and Power
A. Define Type I error, Type II error, and power, and apply these definitions in a given context.
B. In scenarios where binary decisions are necessary, make an informed choice of significance level based on the consequences of Type I and Type II errors.
C. Discuss how conducting multiple tests can increase the risk of Type I errors.
D. Discuss strategies to reduce the risk of Type II errors and increase power.
Describe the relationship between sample size and statistical power.
Given a study design, identify sources of unexplained variation (e.g., imprecise measurements, inadequate controls) that may negatively affect power.
Interpretation of Confidence Intervals
A. Explain why it is important to include a measure of uncertainty with each point estimate.
B. Describe how the width of a confidence interval is related to sample size.
C. Interpret a confidence interval in context, and consider whether the upper and lower limits have different practical implications.
D. Recognize a confidence interval as an estimate subject to error, and distinguish between random and nonrandom errors in a given context.
A. Use numerical summaries taught in the introductory course as measures of effect size.
B. Define a meaningful effect size with justification based on context.
C. Calculate a p-value from a test of a prespecified alternative, such as a minimal important effect size.
Understanding the Research Process and Evaluating Published Reports
A. Define the term selective reporting and discuss the potential benefits of public preregistration of methods and results-blind reviewing.
Critique selection bias within a given article. For example, recognize the tendency to focus on subgroups that result in small p-values when overall evidence for a claim is weak.
B. Discuss the benefits of open science practices and explain why one study is rarely enough to establish full understanding.
C. Explain why different p-values in replication studies do not always imply inconsistent results.
Aschwanden, C. (2015, August 19). Science isn’t broken. FiveThirtyEight. https://fivethirtyeight.com/features/science-isnt-broken/
Boers, E., Afzali, M. H., Newton, N., & Conrod, P. (2019). Association of screen time and depression in adolescence. JAMA Pediatrics, 173(9), 853–859. https://doi.org/10.1001/jamapediatrics.2019.1759
Berkson, J. (1942). Tests of significance considered as evidence. Journal of the American Statistical Association, 37(219), 325–335. https://doi.org/10.1080/01621459.1942.10501760
Castro Sotos, A. E., Vanhoof, S., Van den Noortgate, W., & Onghena, P. (2007). Students’ misconceptions of statistical inference: A review of empirical evidence from research on statistics education. Educational Research Review, 2(2), 98–113. https://doi.org/10.1016/j.edurev.2007.04.001
Nickerson, R. S. (2000). Null hypothesis significance testing: A review of an old and continuing controversy. Psychological Methods, 5(2), 241–301. https://doi.org/10.1037/1082-989x.5.2.241
Socrative [online app]. (2023). Showbie, Inc. Retrieved from https://www.socrative.com/
Wasserstein, R., Schirm, A., & Lazar, N. (2019). Moving to a world beyond “p < 0.05.” The American Statistician, 73(sup1), 1–19. https://doi.org/10.1080/00031305.2019.1583913
Wasserstein, R., Schirm, A., & Lazar, N. (2019). Statistical inference in the 21st century: A world beyond p < 0.05. The American Statistician, 73(sup1).
Wiggins, G., & McTighe, J. (1998). Understanding by design. Alexandria, VA: Association for Supervision and Curriculum Development.
Witmer, J. (2023). What should we do differently in STAT 101? Journal of Statistics and Data Science Education. Published online. https://doi.org/10.1080/26939169.2023.2205905
©2024 Catherine Case. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.