Issue 4.4 / Fall 2022
It’s lovely to have the chance to share some thoughts about another vibrant issue of HDSR. I’m struck, especially, by the way in which this collection of articles illustrates the intellectual breadth of data science. The article “The Use of Data Science in a National Statistical Office,” by Sevgui Erman, Eric Rancourt, Yanick Beaucage, and Andre Loranger (2022), gives a detailed and timely look at how Statistics Canada, Canada’s national statistical office, is embracing new tools, including machine learning, in transforming the data and methods they use and the way they work. Beyond describing illustrative use cases, for example multi-agent reinforcement learning to simulate COVID-19 mitigation strategies and deep learning to understand agriculture from satellite imagery, it is encouraging to also see the attention to ethics through their “framework for responsible ML processes” and its four tenets of respect for people, respect for data, sound application, and sound methods.
We also feature a fascinating contribution, “Psephological Correlated Simulation Techniques With Decision Desk HQ: For the 2022 Midterms and Beyond,” by Sydney Louit, Mukul Ram, Kiel Williams, Alex Alduncin, Patrick McCaul, and Scott Tranter (2022) of Decision Desk HQ (DDHQ), a company who provide real-time election data and forecast the outcomes of elections. The authors describe a region-based model that they have used in the U.S. to model the correlation between election results. Its main claim is that the statistical model is parsimonious enough that it supports fast, even real-time inference, and yet flexible and coherent enough that DDHQ has been the first media source to project the winners of each of the last two U.S. presidential elections.
Turning to our stepping stones articles, “Data Science in Public Health: Building Next Generation Capacity,” by Nicholas Mirin, Heather Mattie, Latifa Jackson, Zainab Samad, and Rumi Chunara (2022), provides a thorough review of data science education at schools of public health, developing a framework to guide schools who are undergoing curriculum review or establishing new programs. The thoughtful framework advocates for (1) integrating social and behavioral sciences into the curriculum, in support of ethical and political considerations, (2) offering a broad data science curriculum, for example offering courses on data communication and fair data use along with the more typical AI and machine learning offerings, (3) support for lifelong learning, and (4) diversifying learning opportunities, for example through mentorship, online modules, and short workshops.
Continuing in a line of HDSR articles about data science initiatives in K–12 education (see Franklin & Bargagliotti, 2020, and Martinez & LaLonde, 2020), the article “Transforming Curriculum and Building Capacity in K–12 Data Science Education,” by Travis Weiland and Christopher Engledowl (2022), makes a compelling case for the need for a large financial and effort commitment in building capacity to deliver on the vision of the updated framework for pre-college statistics and data science education in the U.S. (Bargagliotti et al., 2020). Beyond putting forward six recommendations for curricular transformation, Weiland and Engledowl (2022) also advance five recommendations for building capacity, and strike a note of urgency, writing “there is a dire need to prepare and support mathematics teachers in teaching concepts and practices from statistics and data science and at large scale.” The authors emphasize that the issue of preparing and supporting teachers is exacerbated by the dearth of teacher educators with the right expertise, and call on those of us in higher education to create programs that can prepare educational researchers and teacher educators with the right backgrounds.
There are also two methodological papers in this issue, each offering its own kind of synthesis within the field of machine learning. The first, “On Learnability Under General Stochastic Processes,” by Philip Dawid and Ambuj Tewari (2022), provides a synthesis of theory, showing that the statistical learnability of general, non-independent, and identically distributed stochastic processes is equivalent to online learnability (surprising, given the latter’s non-probabilistic foundations). The second, “Toward a 'Standard Model' of Machine Learning,” by Zhiting Hu and Eric P. Xing (2022), provides a synthesis of methods through a “Standard Equation” (SE) for the objective function of a learning problem, whether this be supervised, unsupervised, or reinforcement learning. The SE comprises an “experience term” to formalize the data available to a learner, a “divergence term” to measure the fitness of a learned model, and an “uncertainty term” that acts as a regularizer for complexity. The authors instantiate the framework to a myriad of different settings and also chart a path toward a “universal solver” that can work under each of the different specifications.
Turning to our columns, in “Moving Forward With the U.S. Census Bureau’s Annual Population Estimates Post-2020,” which appears in Effective Policy Learning, Victoria Velkoff and Christine Hartley (2022) of the U.S. Census Bureau speak to the challenges, both operational and statistical, in conducting the annual Population Estimates Program in the aftermath of the COVID-19 pandemic. In the Mining the Past column, Arunabh Ghosh (2022) writes in “The Mean-ness of Statistics” about the arguments made in the People's Republic of China at least through the 1980s, in regard to the “privileging of formal mathematics at the expense of context” and especially in regard to the use of mean statistics. The article talks about Irving Fisher's legacy not in the sense of his controversial and repugnant views on population and race, but rather a different legacy: that of his reputation in China. Lastly, Edwin Agnew, Lily Zhu, Sam Wiseman, and Cynthia Rudin (2022) contribute “Can a Computer Really Write Poetry?” to Recreations in Randomness. I hadn’t realized that the great Alan Turing had referenced the challenge of computer-generated poetry in his seminal 1950 article until I read this piece. While large language models such as GPT-3 (Brown et al., 2020) are leading to rapid advances in artificial poetry generation, the quality remains relatively poor when compared with other tasks such as question-answering, and Agnew et al. advocate for interpretable models where a user can interact in an easy and transparent loop with a model.
In closing, I will also highlight the article that I was fortunate enough to be able to contribute, together with my colleagues Francesa Dominici and Elizabeth Langdon-Gray, on the occasion of the fifth anniversary of Harvard’s Data Science Initiative (see Dominici et al., 2022). The article was fun to write, and it is good to see it in print. I hope you will enjoy reading about our experiences and aspirations for running such an initiative.
David Parkes is currently on sabbatical at DeepMind for the 2022–2023 academic year.
Agnew, E., & Rudin, C. (2022). Can a computer really write poetry? Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.3c7b563e
Bargagliotti, A., Franklin, C., Arnold, P., Gould, R., Johnson, S., Perez, L., & Spangler, D. (2020). Pre-K-12 Guidelines for Assessment and Instruction in Statistics Education (GAISE) report II. American Statistical Association and National Council of Teachers of Mathematics. https://www.amstat.org/docs/default-source/amstat-documents/gaiseiiprek-12_full.pdf
Brown, T., Mann, P., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, A., Herbert-Voss, A., Krueger, G., Henighan, T. Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., . . . Amodei, D. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
Dawid, P., & Tewari, A. (2022). On learnability under general stochastic processes. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.dec7d780
Dominici, F., Langdon-Gray, E., & Parkes, D. C. (2022). Spinning up a data science initiative at Harvard. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.ad105ec8
Erman, S., Rancourt, E., Beaucage, Y., & Loranger, A. (2022). The use of data science in a national statistical office. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.13e1d60e
Franklin, C., & Bargagliotti, A. (2020). Introducing GAISE II: A guideline for precollege statistics and data science education. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.246107bb
Ghosh, A. (2022). The mean-ness of statistics. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.3e79086d
Hu, Z., & Xing, E. P. (2022). Toward a “standard model” of machine learning. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.1d34757b
Louit, S., Ram, M., Williams, K., Alduncin, A., McCaul, P., & Tranter, S. (2022). Psephological correlated simulation techniques with Decision Desk HQ: For the 2022 midterms and beyond. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.e5a9a4b0
Martinez, W., & LaLonde, D. (2020). Data science for everyone starts in kindergarten: Strategies and initiatives from the American Statistical Association. Harvard Data Science Review, 2(3). https://doi.org/10.1162/99608f92.7a9f2f4d
Mirin, N., & Chunara, R. (2022). Data science in public health: Building next generation capacity. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.18da72db
Turing, A. M. (1950). Computing machinery and intelligence. Mind, LIX(236), 433–460. https://doi.org/10.1093/mind/LIX.236.433
Velkoff, V., & Hartley, C. (2022). Moving forward with the U.S. Census Bureau’s annual population estimates post-2020. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.4ba61ca4
Weiland, T., & Engledowl, C. (2022). Transforming curriculum and building capacity in K–12 data science education. Harvard Data Science Review, 4(4). https://doi.org/10.1162/99608f92.7fea779a
©2022 David C. Parkes. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.