Keywords: replication, errors, error control, testing, p-values
I would like to thank Junk and Lyons (2020) for beginning a discussion about replication in high-energy physics (HEP). Junk and Lyons ultimately argue that HEP learned its lessons the hard way through past failures and that other fields could learn from our procedures. They emphasize that experimental collaborations would risk their legacies were they to make a type-1 error in a search for new physics and outline the vigilance taken to avoid one, such as data blinding and a strict threshold.
The discussion, however, ignores an elephant in the room: There are regularly anomalies in searches for new physics that result in substantial scientific activity but don’t replicate with more data. For example, in 2015 ATLAS and CMS showed evidence for a new particle with a mass of about that decayed into two photons (CERN, 2015). Whilst the statistical significance was never greater than (Aaboud et al., 2016; Khachatryan et al., 2016), the results motivated about 500 publications about the new particle, and countless special seminars and talks (Garisto, 2016). The effect did not replicate when the experimental teams analyzed a larger dataset about six months later (Aaboud et al., 2017; Khachatryan et al., 2017). Although this was a particularly egregious example, experimental anomalies that garner considerable interest before vanishing are annual events (Garisto, 2020).
We are motivated to attempt to control the type-1 error rate because type-1 errors damage our credibility and lead to us squandering our time and resources on spurious effects. Whilst these non-replications aren’t strictly type-1 errors as the statistical significance didn’t reach the threshold and no discoveries were announced, we incur similar damaging consequences, so they cannot be ignored. I shall refer to these errors—substantial scientific activity including publicly doubting the null and speculating about new effects when the null was in fact true—as type-1 errors. Whilst type-1 errors appear to be under control in HEP, type-1 errors are rampant. In the following sections, I discuss these errors in the context of statistical practices at the Large Hadron Collider (LHC).
Searches for new physics at the LHC are performed by comparing a p-value, , against a pre-specified threshold, .
There are two common interpretations of this procedure (Hubbard & Bayarri, 2003):
Error theoretic (Neymaan & Pearson, 1933): By rejecting the null if we ensure a long-run type-1 error rate of . The threshold specified the desired type-1 error rate and the p-value was a means to achieving it.
Evidential (Fisher, 1925): The p-value is a measure of evidence of against the null hypothesis. The threshold specified a desired level of evidence.
Even among adherents of p-values, the latter interpretation is considered unwarranted (Lakens, 2021), and it is almost never accompanied by a theoretical framework or justification, or a discussion of the desired and actual properties of as a measure of evidence.
Unfortunately, Junk and Lyons repeatedly implicitly switch from one to the other. Indeed, the authors (2020) interpret as a measure of evidence and as a threshold in evidence, e.g., justifying by “extraordinary claims require extraordinary evidence” and stating that “[] or greater constitutes ‘evidence’.” We know, however, that interpreted as a measure of evidence, is incoherent (Schervish, 1996; Wagenmakers, 2007) and usually overstates the evidence against the null (Berger & Sellke, 1987; Sellke et al., 2001). For example, there exists a famous bound (Sellke et al., 2001; Vovk, 1993) implying that under mild assumptions corresponds to about posterior probability of the null. This was in fact the primary criticism in (Benjamin et al., 2017). Consequently, one factor in the prevalence of type-1 errors may be that:
physicists interpret p-values as evidence (as do Junk and Lyons);
based on p-values, physicists overestimate the evidence for new effects;
substantial scientific activity on what turn out to be spurious effects
Unfortunately, p-values simply can’t give researchers (including Junk and Lyons) what they want—a measure of evidence—leading to wishful and misleading interpretations of as evidence (Cohen, 1994). This cannot be overcome by better statistical training; it is an inherent deficiency of p-values and no amount of education about them will imbue them with a coherent evidential meaning.
Controlling error rates depends critically on knowing the data collection and analysis plan—the intentions of the researchers and what statistical tests would be performed under what circumstances—and adjusting the p-value to reflect that. There are, however, an extraordinary number of tests performed by ATLAS, CMS and LHCb at the LHC and elsewhere. This already makes it challenging to interpret a p-value at all and undoubtedly contributes to the prevalence of type-1 errors.
Junk and Lyons rightly celebrate the trend in HEP to publicly release datasets and tools for analyzing them. This, however, raises the specter of data dredging. Massive public datasets (CERN, 2020) combined with recent developments in machine learning (Kasieczka et al, 2021) could enable dredging at an unprecedented scale. We must think about what precautions we need to prevent misleading inferences being drawn in the future; e.g., pre-registration of planned analyses as a requisite to accessing otherwise open data. Other more radical proposals, to the problems here and elsewhere, include moving away from an error theoretic approach, or any approach based on p-values.
Aaboud, M., et al. (2016). Search for resonances in diphoton events at TeV with the ATLAS detector. Journal of High Energy Physics, 9. https://doi.org/10.1007/JHEP09(2016)001
Aaboud, M., et al. (2017). Search for new phenomena in high-mass diphoton final states using 37 fb-1 of proton–proton collisions collected at TeV with the ATLAS detector. Physics Letters B, 775, 105–125. https://doi.org/10.1016/j.physletb.2017.10.039
Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E., et al. (2017). Redefine statistical significance. Nature Human Behaviour, 2, 6–10. https://doi.org/10.1038/s41562-017-0189-z
Berger, J. O., & Sellke, T. (1987). Testing a point null hypothesis: The irreconcilability of p values and evidence. Journal of the American Statistical Association, 82(397), 112–122. https://doi.org/10.1080/01621459.1987.10478397
CERN. (2015). [ATLAS and cms physics results from run 2]. https://indico.cern.ch/event/442432/
CERN. (2020). [CERN announces new open data policy in support of open science]. https://home.cern/news/press-release/knowledge-sharing/cern-announces-new-open-data-policy-support-open-science
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49(12), 997–1003. https://doi.org/10.1037/0003-066X.49.12.997
Fisher, R. A. (1925). Statistical methods for research workers. Oliver & Boyd.
Garisto, D. (2020). [The era of anomalies]. Physics, 13, 79. https://physics.aps.org/articles/v13/79
Garisto, R. (2016). Editorial: Theorists React to the CERN 750 GeV Diphoton Data. Physical Review Letters, 116(15). https://doi.org/10.1103/PhysRevLett.116.150001
Hubbard, R., & Bayarri, M. J. (2003). Confusion over measures of evidence (p’s) versus errors (’s) in classical statistical testing. American Statistics, 57(3), 171–178. https://doi.org/10.1198/0003130031856
Junk, T. R., & Lyons, L. (2020). Reproducibility and Replication of Experimental Particle Physics Results. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.250f995b
Kasieczka, G., et al. (2021). The LHC Olympics 2020: A Community Challenge for Anomaly Detection in High Energy Physics. http://arxiv.org/abs/2101.08320
Khachatryan, V., et al. (2016). Search for Resonant Production of High-Mass Photon Pairs in Proton-Proton Collisions at and 13 TeV. Physical Review Letters, 117(5), 051802. https://doi.org/10.1103/PhysRevLett.117.051802
Khachatryan, V., et al. (2017). Search for high-mass diphoton resonances in protonproton collisions at 13 TeV and combination with 8 TeV search. Physics Letters B, 767, 147–170. https://doi.org/10.1016/j.physletb.2017.01.027
Lakens, D. (2021). The practical alternative to the p value is the correctly used p value. Perspectives on Psychological Science. https://doi.org/10.1177/1745691620958012
Neyman, J., & Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 231, 289–337. https://doi.org/10.1098/rsta.1933.0009
Schervish, M. J. (1996). P values: What they are and what they are not. American Statistics, 50(3), 203–206. https://doi.org/10.1080/00031305.1996.10474380
Sellke, T., Bayarri, M. J., & Berger, J. O. (2001). Calibration of p values for testing precise null hypotheses. American Statistics, 55(1), 62–71. https://doi.org/10.1198/000313001300339950
Vovk, V. G. (1993). A logic of probability, with application to the foundations of statistics. Journal of the Royal Statistical Society, B55(2), 317–341. https://doi.org/10.1111/j.2517-6161.1993.tb01904.x
Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review, 14, 779–804. https://doi.org/10.3758/BF03194105
This article is © 2021 by the author(s). The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.