Editor-in-Chief’s Note: Sir David Cox was a towering figure in the field of statistics and in science more broadly. This column showcases Cox's deep care for students' learning as he provides detailed answers to their questions in a course on "Reading Cox," which I co-taught with my colleague Joe Blitzstein in 2012. His responses, annotated by Joe, provide a testament to his exceptional knowledge and insights, as well as his kind and nurturing approach to teaching. Sir David Cox's legacy lives on as a beautiful mind with a beautiful heart, inspiring future generations of scholars to come.
In the fall semester of 2012, just over a decade ago, Xiao-Li Meng and I taught a seminar course called “Reading Cox.” Not surprisingly, the course focused on reading the works of Sir David R. Cox. The students (and instructors) enjoyed reading and discussing a small sample of Cox’s oeuvre. In particular, we read some of Cox’s work on a wide range of topics, including survival analysis, partial likelihood, partially Bayes inference, spatial modeling, transformations, and approximations. We also read the book Principles of Applied Statistics (Cox & Donnelly, 2011), a book that is layered with meaning, and packed with valuable wisdom for statistical novices and experts alike.
We invited Cox himself to visit at the end of the semester. He initially accepted, but for medical reasons it turned out that he was unable to make the trip from the United Kingdom. Of course, he owed us nothing in return, and indeed he had already provided the entire set of materials for the course! But he graciously offered to answer a list of questions from the students. We sent him a list of questions, thinking he would choose just a few to answer, but instead he sent back a six-page, single-spaced document brimming with insights. In this piece, we include this document from Cox, with some annotations for additional context. As a preface to his replies, he graciously wrote: “Here at least is a set of preliminary replies. I thought the questions thoughtful and searching and am very grateful for having had them.”
Cox’s text is presented here verbatim, as it may be of historical interest (and it would not be fitting to edit his words without his approval). For the various references that Cox mentioned, we added citations (to our best estimates of which articles he was referring to). The questions we sent to Cox are indicated by Question. Cox’s responses are included in block quotation, preceded by Cox for clarity. Our annotations for a response come in regular text after the response, labeled with Note (JKB).
Cox: This is neither a theorem nor a universal truth but, I think, a broadly correct guideline. Consider two examples.
Competing risks. We have individuals who are at risk of two types of event. The times to these are represented by two continuous random variables U, V . We observe for each inindividual
, if , otherwise. Give only iid observations on we might be interested in possible dependence between and ; for example and might be time to death from cancer and that from heart disease (obviously very oversimplified). In a nonparametric version it may be shown that any conceivable data can be fitted exactly by a representation with
and independent. If we assumed bivariate log normal, maximum likelihood will allow estimation of corr . But unless we had either a very solid piece of background theory supporting the log normal or external data doing so would we really want to do anything like that. It is saying that the only possible explanation of discrepancy from the independent log normal model is correlation. Of course sensitivity analysis to other departures might help. Papers appear in the literature from time to time claiming that such and such a parametric or semiparametric assumption does resolve the issue. Informative non-response. Key papers are by (Heckman, 1979) and by (Diggle & Gratton, 1984). The simplest example concerns linear regression. For
independent individuals constants are observed and random variables are independently normally distributed with and constant variance. Given the the are binary variables with . If the corresponding is observed and otherwise not. Under these parametric assumptions, estimation of
is possible, although the estimation problem is quite delicate (Robins et al., 1994), but its success hinges crucially on the parametric assumptions. On the other hand if the problem does have a sensible nonparametric solution it may be better to proceed parametrically in that the assumptions then made represent a smoothness in the real world that is helpful for interpretation and data summary. There may also be gains of efficiency but these may be less important.
The essential question: if the parametric assumption resolves the nonidentifiability issue how is this achieved?
Note (JKB): The class was especially intrigued to learn more about Cox’s principles for deciding whether to take a parametric approach or a nonparametric approach to a particular problem. Many books cover only one of these two broad approaches, or introduce both but without much guidance to practitioners on which to choose, or suggest that the choice should be based only on the sample size, rather than emphasizing the assumptions that go into the methods and interpretations that can come out of the methods.
The quotation we asked about stimulated a lot of discussion in class, not only since it is pithy and unlike the advice any of us had seen elsewhere but also since it could be misinterpreted as saying a nonparametric approach is never better than a parametric approach (the first clause suggests improving a nonparametric method to a parametric method, while in the scenario of the second clause neither is satisfactory). Cox’s reply clarifies the fundamental importance of identifiability, sensitivity analysis, and interpretability in choosing between parametric and nonparametric methods.
Cox: For very specific questions for which a limited number of answers are required quickly and securely, often simulation will for most people be the safest and sensible answer. For understanding, analytical approaches will typically be preferable, but of course require skills not so easily acquired (as does simulation in more delicate situations). I’m influenced personally with a lack of computational fluency but also by the fact that like many UK statisticians of my age my education was largely in techniques of mathematical physics. As for such impressive programs as BUGS, note that they have the disadvantage, which probably could be remedied, of being black boxes. The path-way between the data and the conclusion is obscure; it often helps understanding of complicated analyses to see pathways of dependence between the data and the answer.
Note (JKB): Markov chain Monte Carlo (MCMC) has gotten much more efficient both to run and to implement since this question was posed to Cox in 2012, for example, the first release of the probabilistic programming language and platform (Carpenter et al., 2017) was also in 2012. It is getting easier and easier to tackle problems via simulation, making it more and more relevant to think about the advantages and disadvantages of simulation compared with analytical approaches.
Cox points out that simulation is very powerful for getting quick answers to specific questions, but that an analytical approach is often better for gaining a general understanding of a phenomenon. He may have had in mind a contrast between, say, providing 50 pages of simulation results (under various possible distributions for the data and values for the parameters) vs. a beautiful, interpretable, general result such as the asymptotic Normality of the maximum likelihood estimation (MLE).
The last sentence of Cox’s reply goes far beyond the question of simulation vs. mathematical approximation. Situations where the “path-way between the data and the conclusion is obscure” are becoming increasingly prevalent, for example, with the advent of large language models such as GPT (generative pre-trained transformer), LaMDA (Language Model for Dialogue Applications), and LLaMA (Large Language Model Meta AI). We can only speculate (or ask ChatGPT to speculate) about what thoughts Cox would have had about large language models.
Cox: This raises very important issues.
We should look for stability of interpretation.
A synthesis similar to that of Box’s and mine may be helpful; see Aranda-Ordaz’s two papers (Aranda-Ordaz, 1981) for example. See Cox (1961); Atkinson (1970).
In studies of dependence on a linear predictor, ratios of coefficients (even in studentized form) tend to be quite stable. (The ratio of the regression coefficient on
to that on tends to be stable as outcome is transformed nonlinearly; it measures roughly what change in induces the same change as unit change in . There are quite a few papers on this.) Representations should be consistent with known constraints, for example if necessary on physical grounds forced to pass through the origin even if the data give no information on that aspect. Where appropriate they should satisfy the requirements of dimensional analysis.
Even if models consistent with simple subject-matter theoretical models do not give better fits than totally empirical models they may give more stable interpretations.
Note (JKB): This question was motivated from having seen too many instances of analyses with no consideration given to checking the model or assessing whether the results are robust to model misspecification. This comes up both in frequentist approaches (with the issue of sensitivity of the results to whatever modeling assumptions are made) and in Bayesian approaches (with the additional issue of sensitivity to the choice of prior).
For example, a Bayesian analysis may be conducted with a
A full answer to that kind of question is clearly beyond the scope of a reply to an email, and indeed leads to various open problems. Stability has emerged as a core principle in data science (see, e.g.,Yu & Kumbier, 2020), so it is helpful to have guidance on how to assess and attain it. Cox’s comments highlight some key considerations when thinking about stability of interpretation, such as the importance of the choice of transformation and parameterization.
Cox’s response also emphasizes the idea that the statistical model and any relevant domain scientific model should go hand-in-hand; this is also a recurring theme in the Cox and Donnelly (2011) book. For example, dimensional analysis is ubiquitous in physics but, unfortunately, tends to be ignored in statistics even when the data are physical measurements with units (e.g., kilograms). See Shen and Lin (2019) and Lee et al. (2020) for further discussion of the importance of dimensional analysis in statistics.
Cox: I’m the wrong person to ask this because I’m at the age where there is a tendency to overestimate what was known in the remote past. Thus design of experiments was at the core of the subject by 1940, although of course there was substantial development later arising from the process industries in particular. Bartlett discussed shrinkage in the late 1930s to 1940s and Yates in the 1930s gave principled missing value formulae for important simple cases, limited by what was computationally feasible. Now of course there have been massive and very important developments, especially on the missing value theme, since.
There are surely big issues that need general discussion now. Just a few possibilities are possibly:
Different ways of approaching
problems aimed at approaching secure interpretations (as contrasted with apparently accurate predictions) More generally strategic as contrasted with tactical solutions
Devices that will prevent complex methods being black boxes
Note (JKB): The class could not resist the opportunity to ask Cox for some suggestions on open areas that would be ripe for important investigation. Cox modestly claimed to be the wrong person to ask due to his age, even though he had such a broad and deep vision of the field. About ‘black boxes,’ Cox and Donnelly write that “Black-box methods, in which data are fed into a complex computer program which emits answers, may be unavoidable but should be subject to informal checks” (2011, p. 185). As a chess player, I was pleasantly surprised to see Cox’s mention of strategic and tactical solutions. The distinction between strategic (which has to do with big picture, long-term planning) and tactical (which has to do with short-term actions undertaken to achieve a goal) is omnipresent in chess, but I had not seen it discussed in statistics.
Cox: To answer a very specific question simulation is likely in most cases to be the quickest and most secure way to an answer. To get a broad understanding of a field maybe not! Also in infectious disease epidemiology, for example, one can build computer models of impressive realism but then, in the situations I tend to come across, one needs numerical values to large numbers of unknown parameters about which there is very little solid information.
In principle one good approach is to develop a theoretical model leading, for example, to a good low-intensity response and then to express the outcome of simulations as a ratio (simulated to simple theory) in the hope of a smooth and understandable output approaching one in limiting situations. There is always too the issue of what do the investigators involved actually know especially mathematically. It must be a relative advantage of computationally-based skills that their mastery does not demand a very elaborate technical background. (I’m not being dismissive about such fields.)
Note (JKB): Here Cox gives more thoughts about theory vs. simulation, as well as simplicity vs. complexity. His comment on infectious disease modeling in epidemiology is interesting to revisit in the context of COVID-19 models.
Cox: Well: not necessarily must, but done informally out of force of circumstances mostly. If the information comes on a broadly comparable scale (for example log odds or kg per sq m) then synthesis will often be a good idea but if to take the recent studies of bovine TB that I’ve been involved in, the information in many different forms comes from lab pathogenesis studies, from randomized field trials, from historical routine data, from genetic studies, from case-control studies of farm behaviour and from mathematical models of infectious disease epi and from wild-life ecology, how can putting them into say a single posterior be at all insightful?
If by coherence you mean the studies should be more like one another (but I don’t think you do mean this!) then no: but some way of linking different parts of a topic as part of design then yes, very much so. Recall Fisher’s aphorism that to make causality more achievable from observational studies, make your theories elaborate. My understanding is that by elaborate he meant wide-ranging, not complicated. There is a fine paper of 1959 by Cornfield and others reproduced recently (Cornfield et al., 1959): it is referred to in Donnelly’s and my book.
While perhaps a Bayesian synthesis is often involved in an ideal solution note that the Ramsey-de Finetti-Savage formulation is a theory of personal decision making not of presentation of evidence for public discussion action or interpretation. This makes that theory interesting (surely personal issues and considerations do have some role) but very much does not make it a suitable base for main-stream statistics; I am of course not excluding Bayesian arguments, only that particular base for them. Again as in Cornfield et al’s paper it may be best to see separate sources of information summarized separately and critically compared.
Note (JKB): Something the class found striking while reading Cox’s work was that Cox often recommended informal deliberation and diagnostics rather than unnecessary mathematical formalism (despite, or perhaps because of, Cox’s powerful mathematical abilities). For example, he might recommend an informal graphical method for assessing an assumption that data follow an Exponential distribution, rather than a formal test of the hypothesis that the data are Exponential. In Table 2 of Cox (1978), Cox gives ten (!) graphical methods for assessing the fit of an Exponential.
The problem of how to synthesize disparate data sources is fascinating and fundamental, as illustrated by Cox’s bovine tuberculosis example. Meng (2014) spotlights the importance of this kind of problem, coining the term multi-source inference. There is also increasing attention to this kind of problem in the causal inference literature. See, e.g., Athey et al., 2020, and Yang & Ding, 2020).
Cox: Big questions. First in so far as nonparametric is concerned with smoothing, one should tend to undersmooth rather than do theoretically optimal smoothing because it is always easy to smooth a bit more (by eye for instance) and not easy to unsmooth. Next there is a traditional part of nonpar that was solely about cautious testing of very null hypotheses, making usually very strong independence assumptions. I’m not too keen on that.
In any case the critical assumptions about tests etc are more likely to be ones of independence affecting the solidity of estimates of error. Over parametric distributions, so long as the parametric form is used to express and summarize a general smoothness in a feature not of detailed interest I prefer a parametric to a semi-parametric formulation; the popularity of the semiparametric PH model is surprising. A key point is that assessing the broad relative importance of different explanatory variables is pretty independent of the detailed model used. Note though that even in the initial paper on the subject the possibility of time-dependent effects was covered. If such effects are intrinsically important then they should be looked for and choice of formulation becomes important.
In terms of summarizing what has been learned from some data, it seems often more helpful to say that some distribution is approximately normal, log normal or exponential or whatever rather than to say it is arbitrary (or specified by a smoothed pdf) provided that the normality etc is not critical to interpretation.
Models serve various roles and one is to be the basis implicitly of presentation of conclusions and that is where nonparametric is least likely to be effective.
Note (JKB): As is evident from this response, Cox was not overly invested in the survival analysis model that he himself pioneered (the Cox proportional hazards model, which he modestly calls the “semiparametric PH model”). Nor does he dogmatically suggest that a semiparametric approach here is inherently better than a parametric approach due to not requiring parametric assumptions about the baseline hazard function. Instead, Cox points out some subtle, underappreciated considerations about stability, interpretability, and the roles of models.
Sometimes nonparametric methods are described as ‘assumption-free,’ when in fact there may be a strong assumption, such as that the data are independent and identically distributed! Cox emphasizes the importance of independence assumptions, rather than letting the focus in a discussion of assumptions be only about whether a parametric model is used. Implicitly, this then calls for a nuanced approach to choosing between parametric, semiparametric, and nonparametric methods, with detailed consideration of the tradeoffs.
We thank Xiao-Li Meng for co-teaching the Reading Cox course and for various useful suggestions, and the students in Reading Cox for their excellent questions and participation. Most importantly, of course, we thank Sir David R. Cox for his towering contributions to statistics and generosity in writing such thoughtful replies to the long list of questions we sent him when he was age 88.
Joseph K. Blitzstein has no financial or non-financial disclosures to share for this article.
Aranda-Ordaz, F. J. (1981). On two families of transformations to additivity for binary response data. Biometrika, 68(2), 357–363. https://doi.org/10.1093/biomet/68.2.357
Athey, S., Chetty, R., & Imbens, G. (2020). Combining experimental and observational data to estimate treatment effects on long term outcomes. ArXiv https://doi.org/10.48550/arXiv.2006.09676
Atkinson, A. C. (1970). A method for discriminating between models. Journal of the Royal Statistical Society: Series B (Methodological), 32(3), 323–345. https://doi.org/10.1111/j.2517-6161.1970.tb00845.x
Box, G. E., & Cox, D. R. (1964). An analysis of transformations. Journal of the Royal Statistical Society: Series B (Methodological), 26(2), 211–243. https://doi.org/10.1111/j.2517-6161.1964.tb00553.x
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). https://doi.org/10.18637/jss.v076.i01
Cornfield, J., Haenszel, W., Hammond, E. C., Lilienfeld, A. M., Shimkin, M. B., & Wynder, E. L. (1959). Smoking and lung cancer: Recent evidence and a discussion of some questions. Journal of the National Cancer Institute, 22(1), 173–203. https://doi.org/10.1093/ije/dyp289
Cox, D. R. (1961). Tests of separate families of hypotheses. Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 4(1), 105–123.
Cox, D. R. (1978). Some remarks on the role in statistics of graphical methods. Journal of the Royal Statistical Society: Series C (Applied Statistics), 27(1), 4–9. https://doi.org/10.2307/2346220
Cox, D. R. (2005a). Selected statistical papers of Sir David Cox: Volume 1: Design of investigations, statistical methods and applications (D. J. Hand & A. M. Herzberg, Eds.). Cambridge University Press.
Cox, D. R. (2005b). Selected statistical papers of Sir David Cox: Volume 2: Foundations of statistical inference, theoretical statistics, time series and stochastic processes (D. J. Hand & A. M. Herzberg, Eds.). Cambridge University Press.
Cox, D. R., & Donnelly, C. (2011). Principles of applied statistics. Cambridge University Press.
Diggle, P. J., & Gratton, R. J. (1984). Monte Carlo methods of inference for implicit statistical models. Journal of the Royal Statistical Society: Series B (Methodological), 46(2), 193–212. https://doi.org/10.1111/j.2517-6161.1984.tb01290.x
Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica, 47(1), 153–161. https://doi.org/10.3386/w0172
Lee, T. Y., Zidek, J. V., & Heckman, N. (2020). Dimensional analysis in statistical modelling. ArXiv. https://doi.org/10.48550/arXiv.2002.11259
Meng, X.-L. (2014). A trio of inference problems that could win you a Nobel Prize in statistics (if you help fund it). In X. Lin, C. Genest, D. L. Banks, G. Molenberghs, D. W. Scott, & J.-L. Wang (Eds.), Past, present, and future of statistical science (pp. 537–562). CRC Press. https://doi.org/10.1201/b16720-52
Robins, J. M., Rotnitzky, A., & Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American Statistical Association, 8(427), 846–866.
Shen, W., & Lin, D. K. (2019). Statistical theories for dimensional analysis. Statistica Sinica, 29(2), 527–550.
Yang, S., & Ding, P. (2020). Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association, 115(531), 1540–1554.
Yu, B., & Kumbier, K. (2020). Veridical data science. Proceedings of the National Academy of Sciences, 117(8), 3920–3929.
©2023 Joseph K. Blitzstein. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.