Mathematical aspects of polling originate from statistics, a branch of mathematics that permeates natural and social sciences. Inference from large population surveys—the subject of the article “A New Paradigm for Polling” by Michael Bailey (2023)—brings up profound, cross-cutting questions in multivariate statistics that have been familiar for some time in natural sciences. In our own research field of particle and nuclear physics, various techniques rely on multidimensional sampling, both in experiments and in theory. A typical particle experiment counts various elementary particles produced in scattering events, thus sampling the scattering probabilities in the form of the data recorded by the detector. The extraordinary sensitivity of such experiments to physical processes at the tiniest accessible distances comes with a high cost of extreme complexity of the detecting equipment and data analysis. Numerous objective and experimental parameters influence reconstruction of the true particle scattering probabilities from the recorded data
Many experiments in our field perform scattering of composite particles, such as the beams of fast-moving protons and atomic nuclei in the Large Hadron Collider in Switzerland. To understand the fundamental forces of nature, particle physicists strive to measure probabilities of scattering of truly elementary particles, such as gluons and quarks, found inside the incoming protons. Among many insights, these probabilities reveal properties of Higgs bosons and strong nuclear forces responsible for the formation of stable matter in the Universe. On the theory side, distributions of quarks and gluons in the protons depend on many theoretical parameters that must be learned from the scattering data. Such information on the proton structure can be extracted from big data samples by performing global analyses within a settled theory. Extraction of the models of the proton structure is naturally viewed as an inverse problem in a high-dimensional parameter space; uncertainty quantification for these models can be phrased in the language of sampling that must be representative to arrive at reliable estimates (Courtoy et al., 2023). The quality of the samples for a variety of settings of the analyses directly relates to the data defect correlation, the central factor of Meng’s trio identity (Meng, 2018). In contrast to the other applications, the truth model of the proton structure is not available yet, and ultimately the data defect correlation (here specific to the probabilities describing the proton) can only be guessed. Nevertheless, for typical applications, the uncertainty associated with the data defect can be estimated by sampling over a sufficiently large ensemble of the functional forms for the proton structure. The rationale for this approach is to perform reduction of effective dimensions in multidimensional sampling, following the general recommendations for quasi-Monte Carlo integration of multivariate distributions that also employ the trio identity (Hickernell, 2018). A connected question is the effective estimation of the multiple nuisance parameters contributed by the experimental data in such analysis, as briefly mentioned above. Representative sampling of all relevant experimental and theoretical factors entering the uncertainties on the proton structure demands sophisticated statistical techniques, many of which have been already developed.
In his article, Michael Bailey addresses challenges in polling due to unrepresentative sampling of large populations—a problem that is like the physics issue we have just described. Reading this article was interesting as a pedagogical illustration of a fundamental statistical phenomenon that arises in many fields, not just sociology. Bailey observes that the conventional paradigm relying on random sampling has been jeopardized by the widespread problem of nonresponses in large-scale polls. To address the problem, the article advocates for a new practice that employs the data defect correlation as a universal indicator of responsiveness of the population, together with an improved data quantity factor. Bailey’s narrative reflects the difficulty in finding what we call the effective dimensions in topics that involve multifactorial thinking bodies with background and agendas. This idea is expressed through the concept of non-ignorable nonresponse.
In our studies of the proton structure, we have interpreted the bias in the sampling as coming from ad hoc choices to restrict the space of possible solutions. This echoes the rationale depicted by Bailey when introducing the non-ignorable nonresponses. The new paradigm for polling resonates with the general view in physics that observations are often in a complicated relation with the underlying phenomena; this relation is quantified by response (measurement) functions of varying complexity. The data defect correlation captures this relation for an expectation value of a given observable. When this is not sufficient, full response functions are employed.
Physicists have accepted for a long time that the accuracy of measurement generally depends on the values of the observable and confounding factors. In many experimental studies in which all contributing factors can be accessed, well-known procedures are applied to evaluate this relation. For example, the energy of an elementary particle reported by the detecting equipment is generally a mathematical function dependent on several parameters of the measurement, including the particle’s energy itself. To account for the response function, physicists estimate ‘acceptance’ and ‘efficiency’ of the measurement using either a simulation of their experiment or another calibrating measurement. This can be compared to the ellipses of Bailey’s (in press) book and article, for example, in Figure 4. When the vector of factors is known, a transformation matrix can rotate a tilted ellipse to a version that shows no trace of bias, that is, one for which the sample average equals the population average. Such transformations are multidimensional and may require adequate techniques for their optimization, starting from simulation of the effects at the experimental level, mentioned above, to the robust reconstruction of the high-dimensional ellipse (Anwar et al., 2019).
On the other hand, just like in the nonresponse analyses, there exist physics problems in which only a part of the full parameter space is accessible to sampling. The parameter space may be truncated by theoretically motivated constraints (which in physics we would call first principles, such as symmetry considerations) or by the design of the experimental apparatus. Whatever the underlying motivation is, such constraints may enter either as priors in the Bayesian framework, integration limits in sampling, or penalty-like terms in the probability density function. The role of the data defect correlation then consists in quantifying the unfolding of the reduced space into the full parameter space. Technically, such unfolding realizes a mathematical operation of a diffeomorphism that can be nowadays implemented using an AI technique based on a normalizing flow. The data defect correlation summarizes the cumulative effect of the unfolding technique on the observable of interest. Perhaps through applying such detailed techniques we will arrive at the paradise lost.
The world was all before them, where to choose
Their place of rest, and Providence their guide;
They, hand in hand, with wandering steps and slow,
Through Eden took their solitary way.
Research of authors in particle physics is supported in part by CONACyT (Ciencia de Frontera 2019, No. 51244) and DGAPA-PAPIIT IN111222, and by the U.S. Department of Energy under Grant No. DE-SC0010129.
Anwar, R., Hamilton, M., Nadolsky, P. (2019). Direct ellipsoidal fitting of discrete multi-dimensional data. ArXiv. https://doi.org/10.48550/arXiv.1901.05511
Bailey, M. A. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede
Bailey, M. A. (in press). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press. https://www.cambridge.org/core/books/polling-at-a-crossroads/796BAA4A248EA3B11F2B8CAA1CD9E079
Courtoy, A., Huston, J., Nadolsky, P., Xie, K., Yan, M., & Yuan, C.-P. (2023). Parton distributions need representative sampling. Physical Review D, 107(3), Article 034008. https://doi.org/10.1103/PhysRevD.107.034008
Hamilton, M. (2020). Direct ellipsoidal fitting of discrete multi-dimensional data. SMU Journal of Undergraduate Research, 5(1), Article 4. https://doi.org/10.25172/jour5.1.4
Hickernell, F. J. (2018). The trio identity for Quasi-Monte Carlo error. In A. B. Owen & P. W. Glynn (Eds.), Monte Carlo and Quasi-Monte Carlo Methods (MCQMC 2016).
Springer Proceedings in Mathematics & Statistics (vol. 241, pp. 3–27). Springer International Publishing. https://doi.org/10.1007/978-3-319-91436-7_1
Meng, X.-L. (2018). Statistical paradises and paradoxes in big data (I): Law of large populations, big data paradox, and the 2016 US presidential election. The Annals of Applied Statistics, 12(2), 685–726. https://doi.org/10.1214/18-AOAS1161SF
©2023 Aurore Courtoy and Pavel Nadolsky. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.