I am enjoying reading Issue 4.2 of Harvard Data Science Review on an early Sunday morning as I am sitting at a local coffee shop in Cambridge. I love to reflect about data science as I am smelling and drinking espresso. This issue is a bit different from the others. When we think about data science, we often think about data engineering, complex data analytics, machine learning, AI, big data, complex computations. In this issue we have some articles that instead attempt to tackle some of the fundamental philosophical challenges of data science, such as 1) What data visualization should reveal? 2) What is the value of science? 3) How do we maintain data privacy without a loss of information? and 4) How do we adapt to this new normal with online versus in person teaching?
In the Panorama section of this issue, we have an article by Lauren Klein entitled “What Data Visualization Reveals: Elizabeth Palmer Peabody and the Work of Knowledge Production.” Who was Elizabeth Palmer Peabody? Elizabeth Palmer Peabody was born May 16, 1804, in Billerica, Massachusetts, in the United States, and she died January 3, 1894, in Jamaica Plain [now part of Boston], Massachusetts. She was an American educator and opened the first English-language kindergarten in the United States. This essay offers the chronological charts of Elizabeth Palmer Peabody, as early examples of how data visualization can reveal a range of forms of knowledge. When reading this article, I kept thinking about the famous quote by John Tukey: “The greatest value of a picture is when it forces us to notice what we never expected to see.” I think Elizabeth was a precursor of the field of data visualization. Her charts challenge the goals of data visualizations and encourage sustained reflection and imaginative response. By reading this Panorama article, the author prompts us to think about the following questions: What does data visualization reveal? Should it reveal something other than—or in addition to—what is directly captured by the data? Should it show what is missing from the data set, or what cannot be captured by data alone?
Following the Panorama, we have a special theme on the Value of Science with four guest Editors: Frauke Kreuter, Julia Lane, Brian Kim and Allison Nunez. The guest editors wrote an introduction to this special theme so my remarks will be brief. They wrote that this special theme “reflects the results of a conference intended to showcase new data, products, and use resulting from recent data investments in science policy.” They argue (and I fully agree!) that much can be learned about evidence-based decision making when we increase the quality of the data. As I was reading this special theme, I was reminded of the quotes of Richard Feyman. Richard Feynman (1918–1988) was a maverick American Nobel Prize-winning theoretical physicist who was famous for his brilliant wit and effervescent personality. He is best known for his work in quantum electrodynamics and particle physics, particularly famed for Feynman diagrams. Feynman says:
“When a scientist doesn’t know the answer to a problem, he is ignorant. When he has a hunch as to what the result is, he is uncertain. And when he is pretty darn sure of what the result is going to be, he is in some doubt. We have found it of paramount importance that in order to progress we must recognize the ignorance and leave room for doubt. Scientific knowledge is a body of statements of varying degrees of certainty—some most unsure, some nearly sure, none absolutely certain.”
From this quote, I cannot resist reflecting that to me the Value of Science is to reduce the doubt by accumulating rigorous evidence on a particular issue.
In this special theme, we start with an interview with Paul Romer (the 2018 Nobel Laureate in Economics), by Julia Lane. We then have five articles: 1) Demonstrating the Value of Government Investment in Science: Developing a Data Framework to Improve Science Policy by Tobin L. Smith 2) A Linked Data Mosaic for Policy-Relevant Research on Science and Innovation: Value, Transparency, Rigor, and Community by Chang et al. 3) Building on Aotearoa New Zealand’s Integrated Data Infrastructure by Jones et al. 4) Data on How Science Is Made Can Make Science Better by Sourati et al. and 5) Data Inventories for the Modern Age? Using Data Science to Open Government Data by Lane et al. Reading these articles made me realize how critically important this field is and the enormous amount of opportunities we have to use data science to quantify the value and therefore the trust in science.
In Cornucopia, we have an insightful article entitled “Data Quality in Electronic Health Record Research: An Approach for Validation and Quantitative Bias Analysis for Imperfectly Ascertained Health Outcomes Via Diagnostic Codes.” This paper discusses the challenges of measurement error or misclassification of the outcome, which in turn has the potential of biasing the results. Goldstein et al. review the concepts of sensitivity, specificity, and positive and negative predictive values, and they demonstrate how imperfect classification of the outcome variable can indeed bias the conclusions. In this modern age of digital data, the consideration of these issues is critically important. Often we talk about AI and their ability to provide new knowledge when applied to electronic health record (EHR) data. Well, if we are training AI with data that is affected by error, instead of creating new intelligence and knowledge, we will produce new stupidity which will lead to wrong conclusions.
Under Stepping Stones, we review three case studies that work together to help form the main conclusions of the article entitled “World of EdCraft: Challenges and Opportunities in Synchronous Online Teaching.” The three case studies are in the context of 1) teaching healthcare finances at MIT 2) teaching Stat 201 at the university Tennessee Knoxville and 3) teaching introduction to operation management at MIT. The last two years with a pandemic have transformed the way we deliver education and therefore these case studies can provide valuable guidance to many academic institutions.
Under Milestones and Millstones, we have two articles. The first is entitled “Private Prediction Sets.” In this article Angelopoulos et al. present the framework that discusses how to protect individual privacy without losing information when deploying machine learning system for reliable uncertainty quantification. The second article, “Data Flush” by Shen et al., discusses the approach of data perturbation: a technique for generating synthetic data by adding noise to the raw data.
Finally we have two important columns. One is Diving Into Data with an article entitled “Show Us the Data,” in which American statistician Nancy Potock talks about development of these tools for data search and discoverability. In the Minding the Future column, we have an article entitled “Online Education in Times of COVID: Adapting and Deplata Science Program in Mexico.” Here Arredondo-Rodriguez et al. talk about their learned experience after launching a new data science virtual program .instead of summer educational programs used in previous years which typically run face to face. Considering the ‘new normal’ that we are all experiencing due to the COVID19 pandemic, I am expecting that this would be a very insightful reading for many academic institutions around the world.
Francesca Dominici has no financial or non-financial disclosures to share for this editorial.
©2022 Francesca Dominici. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.