Happy 2023! It is always such a pleasure to share some thoughts for the first HDSR issue of the year.
We start with an interview! I had the great pleasure to chat with Professor Sylvia Richardson about data science and leadership (Richardson & Dominici, 2023). Sylvia has so many accomplishments. To give you a few:
Sylvia Richardson is a Medical Research Council (MRC) investigator at the MRC Biostatistics Unit and has held a research professorship in the University of Cambridge since 2012. Sylvia was also director of the Unit from 2012 to 2021.
Before joining the MRC Biostatistics Unit, she held the Chair of Biostatistics in the Department of Epidemiology and Biostatistics at Imperial College London since 2000 and was formerly Directeur de Recherches at the French National Institute of Health and Medical Research (Inserm), where she held research positions for 20 years. In 2019, Sylvia was awarded Commander of the Most Excellent Order British Empire (CBE). Sylvia has also been awarded the Guy Medal in Silver from the Royal Statistical Society and a Royal Society Wolfson Research Merit Award. She is a fellow of the Institute of Mathematical Statistics and of the International Society for Bayesian Analysis.
I have known Sylvia for over 20 years; she has always been an incredible role model for me. Her energy, commitment to the truth, resilience, and critical and constructive thinking has helped me to be a better statistician. When I was in graduate schools and during my postdoctoral fellowship, her book Markov Chain Monte Carlo in Practice was my bible! I hope you will enjoy hearing about our conversation.
Another interesting piece in this issue is the Panorama article by Adel Daoud and Devdatt Dubhashi (2023), which talks about the three cultures of statistical modeling: 1. prediction, 2. causal research, and 3. hybrid modeling culture. The importance of this distinction stems from the explosion of new algorithms in machine learning (ML). Often, when there is a fast-growing development of a new algorithm (like ML), we (the data scientists) are eager to join the club and do not pause to think foundationally. The authors correctly point out that the distinction between prediction and causal inference has been blurred by the introduction of ML. Machine learning may have been quickly (too quickly?) repurposed for making inferences about causation without anchoring the thinking to the more foundational concepts in causal inference and statistics. Their article introduces a hybrid modeling culture that nicely distinguishes: 1) ML for causal inference, 2) ML for data acquisition, and 3) ML for theory prediction. For each of these topics, the authors clarify: a) the question, b) the goal, c) the key assumptions, and d) the quantity of interest! I cannot wait to share this article broadly and with my research team as we are struggling to formalize these concepts every day.
I found the article “A Review of Data Valuation Approaches and Building and Scoring a Data Valuation Model” by Mike Fleckenstein, Ali Obaidi, and Nektaria Tryfona (2023) about the value of data interesting and original. Often, I have been arguing in the classroom that there is no data science without data, but I never thought how you quantify the value of data in terms of its market value as an asset. We often hear that data is becoming the new currency across our economy. It is a clear indication that we, as a society, want a way to value data in concrete terms. But we are not there yet. This article introduces three approaches for data evaluation: 1) market-based models, which calculate data’s value in terms of cost and revenue/profit; 2) economic models, which estimate data’s value in terms of economic and public benefit; and 3) dimensional models, which assign value based on data dimensions like data quality and ownership—both data-specific and contextual. Now the question is how do we reconcile these concepts with the increased pressure of reproducibility and the importance of making the data publicly available? Read!
This issue’s Cornucopia article by Michael A. Stoto, John D. Kraemer, and Rachael Piltch-Loeb (2023) introduces better metrics to guide public health policy, with lessons learned from COVID-19 for data system improvement. The authors advocate that it is important to report data in addition to the cases, such as deaths and hospitalizations, and that the Centers for Disease Control and Prevention must provide leadership to standardize public health, hospital, and other data systems. Taking a systems perspective, their article summarizes what needs to be monitored to guide policy, describes the main sources of data used to construct metrics, and assesses strengths and weaknesses of the proposed data systems. Stoto et al. also show how to make the best of current data systems and identify strategies for improving data systems for future public health emergencies.
In this issue we have three articles focused on the education of data science. In the first, “Motivating Data Science Students to Participate and Learn,” Deniz Marti and Michael D. Smith (2023) advocate for the importance of teaching concepts of privacy, ethics, and fairness. In this article, the authors offer insights into how to structure our data science classes so that they motivate students to deeply engage with material about societal context and lean toward the types of conversations that will produce long-lasting growth in critical thinking skills. The authors describe a novel assessment tool called participation portfolios to promote student autonomy, self-reflection, and the building of a learning community. They showed that their participation portfolio materially improved students' engagement, motivating students to offer authentic opinions and create higher quality discussions.
In the second data science education article, “Data Science Transfer Pathways From Associate’s to Bachelor’s Programs,” Benjamin S. Baumer and Nicholas Jon Horton (2023) provide an excellent overview of opportunities and barriers to data science transfer pathways from associate’s to bachelor’s degree programs. The authors pointed out that
“While the number of 4-year colleges offering bachelor’s degrees in data science continues to increase, data science instruction at many 2-year colleges lags behind. A major impediment is the relative paucity of introductory data science courses that serve multiple student audiences and can easily transfer.”
They are so right: we should do everything in our power to foster these pathways!
The third education-based article is a Minding the Future column written by Anne Mykland, a current high school student. Mykland (2023) speaks about the role of research in data science education, starting from high school data science. The author argues that the traditional instructive method might cause students to become too caught up in details to be unable to see the breadth of interesting things that can be achieved through code. My daughter, Enrica, is a high school junior, and I have been worried that the way they teach statistics and computer science in high school could scare her away! I concur with the author that high school students are passionate about solving important problems. Recently I invited Filippo, a high school senior and family friend, to attend one of our lab meetings. He expressed interest in data science during a family gathering. In this meeting, Falco Bargagli Stoffi, one of my postdocs, presented a research paper where we are using data science and causal inference to address a critically important question: whether increasing the walking distance between schools and the nearest gun dealers might lower the risk of a school gun incident. I cannot tell you (yet) what we found, as we are still working on this paper, but the point I want to make here is this: after Filippo attended this meeting, he called me and said, “Wow, I had no idea that data science research was so exciting! I thought that all you were doing was boring math and coding!” Of course, I totally agree with Mykland that we cannot discard the traditional approach (sorry, Filippo, you will still need to take traditional statistics and computing classes), but data science is also about hands-on experience. Indeed, the hands-on experience is what allows you to become an effective and independent data scientist!
Last but not least, I hope you will enjoy the article “Statistics and Data Science for Cybersecurity” by Alfred Hero, Soummya Kar, Jose Moura, Joshua Neil, H. Vincent Poor, Melissa Turcotte, and Bowei Xi (2023). This article highlights research questions at the forefront of cybersecurity lying at the intersection of statistics, machine learning, and information theory. This is a very comprehensive piece on statistics and data science for cybersecurity, and there will be several additional discussion pieces rolling out in HDSR over the next month to further the conversation. Hero and colleagues correctly argue that data science has a lot to contribute to address emerging vulnerabilities to data security. The article discusses the following topics: 1) the emerging role of data science in cybersecurity; 2) opportunities to incorporate data science into cybersecure enterprise systems; 3) the physical layer of cybersecurity and resilient decision-making algorithms for distributed sensor, actuator, and transmission networks characteristic of the emerging Internet of Things (IoT); 4) challenges in making the data collected by decentralized IoT networks less sensitive to attacks; and 5) information-theoretic approaches to cybersecurity, specifically privacy and security, for IoT networks. The article also provides a broader discussion of cross-cutting themes and charts a course for future research directions. It is undeniable that there are many opportunities for statisticians, engineers, and computer scientists to come together to develop the next generation of cybersecurity practices.
Francesca Dominici has no financial or non-financial disclosures to share for this editorial.
Baumer, B. S., & Horton, N. J. (2023). Data science transfer pathways from associate’s to bachelor’s programs. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.e2720e81
Daoud, A., & Dubhashi, D. (2023). Statistical modeling: The three cultures. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.89f6fe66
Fleckenstein, M., Obaidi, A., & Tryfona, N. (2023). A review of data valuation approaches and building and scoring a data valuation model. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.c18db966
Hero, A., Kar, S., Moura, J., Neil, J., Poor, H. V., Turcotte, M., & Xi, B. (2023). Statistics and data science for cybersecurity. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.a42024d0
Marti, D., & Smith, M. D. (2023). Motivating data science students to participate and learn. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.d3b2eadd
Mykland, A. (2023). The role of research in data science education. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.06ed24c0
Richardson, S., & Dominici, F. (2023). A conversation with Sylvia Richardson. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.6c87bfc0
Stoto, M. A., Kraemer, J. D., & Piltch-Loeb, R. (2023). Better metrics to guide public health policy: Lessons learned from COVID-19 for data systems improvement. Harvard Data Science Review, 5(1). https://doi.org/10.1162/99608f92.3e516c04
©2023 Francesca Dominici. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.