Skip to main content
SearchLoginLogin or Signup

From COVID-19 to GPT-4o: The Groundbreaking Quinquennium for Harvard Data Science Review (and Humanity)

Issue 6.3/ Summer 2024
Published onJul 31, 2024
From COVID-19 to GPT-4o: The Groundbreaking Quinquennium for Harvard Data Science Review (and Humanity)
·

Launched on July 2, 2019, Harvard Data Science Review (HDSR) has just celebrated its first quinquennium, a journey of overcoming obstacles and savoring successes. Filled with sweet, sweaty, and sweltering days (and nights), HDSR has published 20 regular and 5 special issues prior to this current issue, featuring over 500 articles in 10 different formats. In 2021, it received the PROSE (Professional and Scholarly Excellence) Award from the Association of American Publishers as the Best New Journal in Science, Technology, and Medicine.

The journey was sparked by a conversation on November 20, 2017, with my colleague Francesca Dominici, the faculty director of the Harvard Data Science Initiative (HDSI). At the time, she and her faculty co-director, David Parkes (now the Dean of Harvard’s School of Engineering and Applied Science), were conducting consultative discussions throughout the university about building HDSI. When Francesca asked for creative ideas for HDSI, the thought to publish HDSR emerged naturally, partly because I had previously contemplated creating a Harvard Statistics Review (HSR) when I was chairing the Department of Statistics. Of course, no good deed goes unpunished. In the spring of 2018, Francesca and David informed me that they liked the idea of establishing HDSR, but I had to be the one to launch it.

Timing is everything, especially in our increasingly dynamic and volatile world. HSR, with statistical certainty, will remain forever as a statistics chair’s self-musing because “everything statistics and statistics for everyone” has been subsumed by “everything data science and data science for everyone,” regardless of how the statistician inside of me embraces or resists this evolution or revolution. Reflecting on the epochal global challenges and opportunities of the past 5 years, without any harbinger in sight for me back in 2018, I must thank providence for granting me the gumption—and countless hearts and hands—needed to launch HDSR within a single year. The HDSR editorial office was established on July 1, 2018, when I began my term as the Founding Editor-in-Chief. We worked around the clock, sometimes literally, to launch HDSR by July 1, 2019. Readers who have the patience to read through this celebratory and reflective editorial for the current issue will be regaled with the story on why the launching of HDSR was delayed to July 2, 2019, another providential interference if we bless ourselves with the symbolic significances of that eventful day.

The timing could not have been more fortuitous. Just as HDSR embarked on its ambitious journey to become a global forum for everything data science and data science for everyone, a global pandemic was also in the making. With three warm-up issues under its belt, the editorial board of HDSR worked feverishly to launch the first special issue of HDSR on May 14, 2020: COVID-19: Unprecedented Challenges and Chances, which was completed a year later on May 14, 2021. As I wrote in its editorial, “All COVID-19-related challenges are virtually the same in nature: a massive stress test on a global scale” (Meng, 2020a). These included a grand test on the data science ecosystem. As I discussed in the editorial in the context of introducing the initial articles in the special issue, the data science enterprise, like many other systems, was caught off guard. The pandemic was uncontrollable temporally and spatially, but we did not (and still do not) have a data science emergency response mechanism, from collecting reliable data to providing trustworthy information. We were in a perfect storm.

Crises compel creativity. As a nano example, the need to communicate expeditiously wide-ranging perspectives to the public, while virtually all experts were under great stress and had little time to write, led HDSR to create panel articles. A panel article builds upon an oral discussion (e.g., a Zoom meeting) but is not confined by its transcript, permitting more careful contemplations and reflections in print but with significantly less time demand than writing from scratch. The conversation-style presentation can also engage more effectively readers from very diverse backgrounds, which is especially important for HDSR. To ensure flow, coherence and overall quality, however, the moderator of the panel does have a significant task to execute. I was therefore deeply grateful to HDSR board member David Banks for organizing two panel articles for the special issue on COVID, providing highly pertinent and timely reflections and perspectives on dealing with the pandemic by economists, sociologists, statisticians, and operations researchers (Banks, Albert, et al., 2020) and by biostatisticians and epidemiologists (Banks, Ellenberg, et al., 2020).

History often repeats itself. The launching of ChatGPT and more broadly the arrival of generative AI—a host of globally disruptive, groundbreaking technologies—has created another epoch where timely and multi-perspective reflections and scrutiny are essential. To reinforce the adage that no good deed goes unpunished, I reached out to David again. Without any hesitation, David organized two panel articles, again expeditiously and expertly, for which I am deeply grateful (as well as for his contributions to two more articles in this issue, to be introduced shortly).

The timeliness of these discussions is well reflected in the broader panel article, “Data Scientists Discuss AI Risks and Opportunities,” by David Banks, Gerard de Melo, Sam (Xinwei) Gong, Yongchan Kwon, and Cynthia Rudin (2024), especially with its acknowledgment in its abstract that the time between the initial panel discussion (December 2023) and the publication could be sufficient to render some discussions out-of-date. The second panel article, “Large Language Models: Trust and Regulation,” by David Banks, Costanza Bosone, Bob Carpenter, Tarak Shah, and Claudia Shi (2024), dives into knotty issues of trusting and regulating large language models.

Amidst the numerous ongoing discussions about generative AI, the panel articles offer only a snapshot of our collective thoughts on the technology’s capabilities and limitations, as well as our reflections on its ethical considerations. This editorial, by connecting the articles in this issue with those from the inaugural issue of HDSR and by sampling its activities in the first 5 years, provides a glimpse into the evolution of data science and AI over the past 5 years.

AI Revolution: Has That Happened Yet?

Guided by the mission of publishing everything data science and data science for everyone, HDSR engages readers via four sections and many columns. The first section is Panorama, featuring overviews, visions, and debates. In addition to the aforementioned two panel articles organized by David Banks, this issue’s Panorama section is led by another panel article, “Amid Advancement, Apprehension, and Ambivalence: AI in the Human Ecosystem(Berman et al., 2024). It is based on a panel moderated by Francine Berman, with panelists David Banks, Michael Jordan, Sabina Leonelli, and Martha Minow, during the day-one symposium for HDSR’s 5th anniversary celebration, AI and Data Science: Integrating Artificial and Human Ecosystems, spotlighting the work and perspectives of HDSR’s board members and authors. (The day-two symposium, Vine to Mind, will be covered in future issues.) Berman et al. provide a rich set of contemplations over the power and peril of generative AI by leveraging the multiple disciplines represented in the panel, namely, computer science, law, philosophy, and statistics.

One of the panelists, Michael Jordan, a preeminent computer scientist and statistician, was also the author of the discussion article, “Artificial Intelligence—The Revolution Hasn’t Happened Yet” (2019) in the inaugural issue, which has been one of most viewed articles in HDSR. The term ‘AI revolution’ is now used routinely by media and in our daily conversations, and few would disagree that generative AI such as GPT-n is a disruptive technology at global scale. Therefore, some future historians of science would likely compare Jordan’s writing then and now as a data point for understanding the dynamics of evolution of the AI era we live in, including how the term AI is being appreciated or understood.

Jordan’s main point in his 2019 article was that we were still far away from having “intelligent” machines, or human-imitative AI, but neither should that be humans’ aspiration. He argued for IA (intelligence augmentation), where data and computation would be used to augment human intelligence and creativity, and for the need to build II (intelligence infrastructure), “whereby a web of computation, data, and physical entities exists that makes human environments more supportive, interesting, and safe.”

Whereas I won’t quench readers’ curiosity to find out Jordan’s current views by summarizing them from the Berman et al. (2024) panel article, I want to mention that he is not the only panelist on this panel who has published in the inaugural issue of HDSR. Sabina Leonelli, a leading philosopher of data science, published one of the three articles on delineating the data science enterprise in the inaugural issue: Leonelli (2019) on data conceptualization, which conceptually precedes Jeannette Wing’s (2019) article on data life cycle, which is followed by Christina Borgman’s (2019) article on the “after lives” of data.

Over the last 5 years, I have given dozens of talks where I cited this trio of articles to engage audiences in a discussion on what data science is, or rather, what it is not (e.g., the keynote address at 2020 ComputeFest [Meng, 2020b]). Leonelli's (2019) point that “there is no such thing as ‘raw data,’” has been an emphasis of my presentations because one cannot overemphasize the fact that “data are forged and processed through instruments, formats, algorithms, and settings that embody specific theoretical perspectives on the world,” as Leonelli succinctly summarized. Leonelli’s emphasis on considering human influence and impact first is also clear from the outset in her panel discussion. For a series of questions about responsible AI, including its meaning and impact, from the moderator, Leonelli’s first response was, “But the primary question, in my view, is who sets the problems that AI algorithms are meant to solve?” (Berman et al., 2024).

The article in this issue’s Bits and Bites column, “Is ChaGPT More Biased Than You,” by Golnoosh Babaei, David Banks, Costanza Bosone, Paolo Giudici, and Yunhong Shan (2024), provides a timely empirical study to echo Leonelli’s questioning. The very term ‘bias’ is a human choice, depending on what problems we want the AI algorithms to solve. For those who want the output of a generative AI algorithm to reflect their ideal world or ideology, anything deviating (systematically) from their desiderata is a form of bias, including the reality. And with enough human intervention, output of large language models (LLM) can be “painstakingly politically correct,” as Babaei et al. (2024) reported. However, for those who want to leverage the power of LLM to understand the reality, such as for political polling (Sanders et al., 2023), then anything that alters the reality is a mechanism for bias that we should at least be aware of.

“Context Changes Everything”

The second article in the inaugural issue trio, Wing (2019), is intimately connected with “Data Science and AI in Context: Summary and Insights” by Alfred Spector (2024), the closing article in the Panorama section of the current issue. The summary in the title refers to the fact that the article was invited to provide an overview of the textbook which Wing co-authored, Data Science in Context: Foundations, Challenges, and Opportunities (Spector et al., 2022), and the article provides insights gained from the book and since its publication, especially in view of the rapid evolution of generative AI since the launching of ChatGPT in November of 2022. The first excerpt of the textbook included in Spector’s (2024) article starts with “Data science is the study of extracting value from data – value in the form of insights or conclusions.” Wing (2019) opened her article with the same description of data science (without the clause on the form of value), followed by an explanatory sentence: “‘Value’ is subject to the interpretation by the end user and ‘extracting’ represents the work done in all phases of the data life cycle.”

The phases of the data life cycle, as listed in Wing (2019), include generation, collection, processing, storage, management, analysis, visualization, and interpretation. The contextual integrity of each phase and of the cycle should be given minimally the same emphasis as it has received in the literature on data privacy (Nissenbaum, 2004), even though the details may differ significantly. In the context of privacy, Nissenbaum (2009) scrutinizes the five ingredients of contextual integrity: contexts, informational norms, actors, attributes, and transmission principles. This editorial would become a book if it were to examine to what extent Nissenbaum’s construct is applicable and modifiable to the data life cycle; however, by making contextual dependence a centerpiece for data science and AI, Spector et al.’s (2022) textbook and Spector’s (2024) article provide both inspirations and impetus for us to engage ourselves in that thought process. For example, contexts, value norms, and actors are clearly key ingredients of any phase, and considerations of sharable attributes and transmission principles are essential for advancing among phases, if we want to ensure our data science products are reproduceable, replicable, and responsible.

The issue of reproducibility is discussed in Spector et al. (2022) and Spector (2024) as a key component of what they term as understandability, though I surmise that the term ‘reproducibility’ is used here for both computational reproducibility and scientific replicability—see the special theme on Reproducibility and Replicability in HDSR for discussions on the distinction. Regardless of the terminology, verifying either computational reproducibility or scientific replicability requires careful contextual considerations to do well. This point is made crystal by Borgman (2019), the third of the inaugural trio. Borgman’s discussion on data’s “after lives” reminds us that data curation and preservation are vital for reproducibility, and that “Determining what data to keep, why, how, and for how long, is the challenge of our day.” It is challenging precisely because sensible answers to these questions depend strongly on their context, and thereby require a great deal of ideographic contemplations.

And it is not just the data, but also the metadata, ontologies (see Pasquetto et al., 2019), and the whole data provenance, which is critical also for nomothetic purposes, that is, when we generalize knowledge learned from individual entities to collections perceived to share similarities. Incidentally, a telling example of the importance of data provenance is provided by Jordan’s (2019) reflection on how his contextual understanding of an ultrasound’s data helped his wife to avoid the risk of an unnecessary amniocentesis:

“The problem had to do not just with data analysis per se, but with what database researchers call provenance—broadly, where did data arise, what inferences were drawn from the data, and how relevant are those inferences to the present situation?”

Among all the media taglines that I have come across, the most arresting one is Bloomberg Media’s “Context Changes Everything.” Launched on September 12, 2023, the Bloomberg Media (2023, emphasis in original) tagline is meant to remind all of us that “Context changes how you see things. Context changes how you change things.” The tagline changes my flow of thought every time it reaches my ears, because it makes me reflect on whatever was on my mind. I certainly hope the ongoing emphasis on the centrality of context in data science and AI, as powerfully demonstrated and argued in Spector et al. (2022) and Spector (2024), will have a similar effect. That is, it should help all of us to engage in habitual contextual reflections on what we do as data scientists, data science educators, or simply as citizens of the digital age.

Contextual Understanding in Action

Cornucopia is the title of the second section of HDSR, where we feature contents pertaining to impact, innovation, and knowledge transfer. Contextual understanding in such contents often is a matter of life or death, or using a less dramatic academic term, acceptance or rejection for (top) journal submissions, (competitive) grant proposals, and so forth. This point is well demonstrated by Lo et al. (2019) in the inaugural Cornucopia section in the context of using machine learning to predict drug approvals (by regulatory agencies such as the Food and Drug Administration of the United States). Lo et al. were by no means the first who used machine learning for these purposes. But their approach outperformed their predecessors’ because they understood that the substantial number of cases with incomplete data in the database should not be discarded, which was (and unfortunately still is) a common practice when a machine learner does not know how to deal with the incomplete cases.

The issue is not merely the scarification of information by not using all the available data. The more serious and indeed often fatal problem is that by only learning from the cases where data are complete, one is only learning the patterns of these cases, which are systematically different from those with incomplete data when the reason for incompleteness is related to the outcome we care about. For example, Lo et al. (2019) reported that “completed trials tend to have greater levels of missingness than terminated trials,” a pattern that certainly should be considered in studying the probability for successful drug approval. While dealing with missing data is never easy, their knowledge of the data-collection process helped Lo et al. to form a statistical imputation model to predict what the missing values could have been, before feeding the data into a pattern-seeking algorithm from machine learning.

The contextual understanding is also critical for the article “A Self-Supervised Algorithm for Denoising Photoplethysmography Signals for Heart Rate Estimation From Wearables” by Pranay Jain, Cheng Ding, Cynthia Rudin, and Xiao Hu (2024) in the Cornucopia section of the current issue. Photoplethysmography (PPG) signals, like most signals from gadgets, contain both revelatory information and obfuscatory noise, such as distortions due to body movements or sweat. The proposed method takes essentially the same two-step strategy as Lo et al.’s (2019) approach, by first detecting and removing the corrupted parts and reconstructing them from the uncorrupted parts via self-supervised learning. One can then estimate the heart rate and other useful information from the partially imputed PPG signals. Its applicability will depend on the context, since if the distortion mechanism is associated with the outcome of interest (e.g., exercises may raise both heart rate and sweating level), then imputing the original signals in the corrupted segments by learning from uncorrupted regions can produce severely biased results.

Education Is a Stepping Stone to Everything

A grand irony of university publications is that the vast majorities of them do not have a single article that addresses the educational aspects of the pertaining disciplines, even though the entire university enterprise owes its existence and survival to being in the education business, and teaching and learning is the bloodline to every academic discipline. (Just imagine how a business or government publication can survive without ever publishing anything on running business or government.) The tradition has been so strong that when HDSR announced its trinitarian mission, as a premier research journal, a cutting-edge educational publication, and a popular magazine, I was warned repeatedly that it cannot be done. I will leave it to readers to judge how well or poorly HDSR has been fulfilling this trinitarian mission in its first quinquennium. But I am very encouraged and proud to see the wide-ranging topics that have populated the third section, Stepping Stones, designed to feature articles on learning, teaching, and communication in data science.

As a matter of fact, in the last 5 years, Stepping Stones has featured articles on data science education at every possible level, from kindergarten (e.g., Martinez & LaLonde, 2020) to 2-year colleges (Kotz, 2020) and to employee training (e.g., Kreuter et al., 2019). The two Stepping Stones articles in the current issue further demonstrate this breath in coverage and depth in content as research on education.

The article by Kristina Gligorić, Tiziano Piccardi, Jake Hofman, and Robert West (2024), “In-Class Data Analysis Replications: Teaching Students While Testing Science,” documents a much-needed study on incorporating teaching reproducibility as an integrated part of data science education at college and university levels. It is a timely study because there has been an increasing call to action that to address effectively the so-called ‘replication crisis’ in science, we must start with education. This was called by Marcia McNutt, President of the National Academy of Science, in her 2020 article for the special theme on Reproducibility and Replicability in HDSR. The call was also made by the National Academies of Sciences, Engineering, and Medicine (2018) in their report, “Data Science for Undergraduates: Opportunities and Options,” of which an interview discussing the report was conducted with authoring committee co-chairs and published as the inaugural Stepping Stones article (Haas et al. 2019).

Reproducing what others have done is never easy, especially without having full access to what was available to others. The authors of Gligorić et al. (2024) made various choices to facilitate the students’ projects and learning, choices that we may vary on depending on the learning environment we are in and our educational philosophy, for example, how to balance reducing stress from learning in school and stress testing for resilience in the real world. Nevertheless, the article sets a higher standard for all of us who conduct or care about educational research, namely, to consider stakeholders beyond learners and educators (in this case scientists) and to contemplate big questions on the ultimate impact of an education program as a stepping stone (e.g., for improving the state of science).

The second Stepping Stones article in this issue asks another big question: What do we know about data science education in the precollegiate level, known as K–12 in the U.S. system? The article “Data Science Learning in Grades K–12: Synthesizing Research Across Divides” is by Joshua Rosenberg and Ryan Seth Jones (2024) and notes that the last 5 years have seen a particular growth in data science education research and in program and course development at both precollegiate and postsecondary levels. To provide a broad and informative picture of the landscape of the research publications on data science education, the authors propose an agency framework to synthesize articles in three special issues published respectively in 2020 by the Journal of the Learning Sciences (JLS), in 2022 by the British Journal of Educational Technology (BJET), and also in 2022 by Statistics Education Research Journal (SERJ).

By agency, Rosenberg and Jones (2024) refer to who and what motivates the instructional design, and they identify three agencies that can account for much of the variations in foci, methods, and languages of the education research articles they examined. For those of us who are more statistically oriented, we can think of this as a qualitative counterpart of ANOVA (analysis of variance), with three factors. The centerpiece of the three agencies is the material agency, which is defined in Rosenberg and Jones (2024) as “the material world and the machines we build to control the world influence the types of learning that we value in data science education.” The personal agency and disciplinary agency then are the personal processes and social processes by which, respectively, humans and communities attempt to capture and control the material agency.

Whereas the three special issues may or may not be intentionally designed to capture these three agencies, the authors found that the special issues in JLS, BJET, and SERJ focus respectively on the personal, disciplinary, and material agencies. I find such framing rather refreshing and am inspired to ask for more. For example, putting on my statistician’s hat and thinking along the line of ANOVA, I am curious about how the within-issue variations compare to the between-issue variation as evidence for the usefulness of agency framing, and how to qualitatively make such comparisons as a possible direction for deepening this line of scholarship. In general, I eagerly look forward to seeing more scholarly and penetrating research about data science education in the Stepping Stones section (and elsewhere). As education is a stepping stone to everything, we owe it to future generations to investigate the effectiveness of our education approaches and programs, and to improve and innovate, with all conceivable effort.

Yes, HDSR Is Also for Those Who Love Greek Letters …

The mission to feature everything data science and data science for everyone also means to dive into the deep foundational issues in data science and to feature content that can inspire the deep thinkers of the data science ecosystem. This is the focus of the fourth section of HDSR, Milestones and Millstones, devoted to foundations, theories, and methods, where we feature articles as theoretical as those in any specialized theoretical journal. It is worthwhile to emphasize that the term ‘theoretical’ is not synonymous with ‘mathematical’ in the broad context of data science research in view of the tendency in more mathematically enabled fields, such as statistics and computer science, to judge how theoretical an article is by the amount of mathematical symbols and formulae, which typically contain Greek letters.

The breadth of the theoretical articles in the Milestones and Millstones section is demonstrated by the two articles in the first issue of HDSR and the one in the current issue. None of the articles in the inaugural Milestones and Millstones section contain any Greek letters, yet they offer insightful theoretical thinking and exploration. Monajemi et al. (2019) delve into massive computational experimentation, while Floridi and Cowls (2019) address ethical principles for AI. The rapid advent of generative AI has brought the two topics to global attention, with both being intensively researched and scrutinized. Again, I must thank the providential blessing that HDSR featured both topics in its very first issue.

In contrast, the current Milestones and Millstones article, “Differentially Private Linear Regression with Linked Data” by Shurong Lin, Elliot Paquette, and Eric D. Kolaczyk (2024), is populated with mathematical formulae and many Greek letters because differential privacy is a mathematical framework for injecting a controlled amount of noise into data to address the thorny issue of balancing privacy protection with information preservation. The problem addressed in Lin et al. is particularly knotty because the data come with added noise due to errors in linking two data sources. However, such noise themselves can have a data privacy protection effect, in which case intuitively we may wonder if we could inject less noise than when we would have with data without linkage noise. But is our intuition sound? Even so, how much less could we inject? Questions of such a nature are essentially impossible to answer reliably and confidently without carrying out careful mathematical reasonings and calculations.

These investigations and results are not for everyone, and indeed not for many data scientists. Even someone like me, who conducts research and lectures on differential privacy, may still need to fully digest the implications of the mathematical results provided in this article. For example, the article prompted me to ponder more deeply about how to take into account the inherent noise in the data as a part of privacy protection. This question is not readily addressable within the current popular framework of differential privacy, since it treats data as static numerical values, and hence by (intentional) design the framework avoids the contextual consideration of the data. However, contextual consideration is essential for understanding and then possibly taking advantage of the inherent noise in the data for protecting information that can be revealed by the data, which is a more nuanced task than concealing the numerical values in a data set.

But this is the very reason that HDSR publishes articles with highly mathematical content, as long as the subjects are of great importance to data science. Mathematical framing makes it possible for us to be precise and transparent about the scope, assumptions, methods, metrics, results, as well as limitations and blind spots, whether for our own contemplations or for communications with others. Having specialists to work out the intricate mathematical results can provide new or deeper insights for the benefit of the broad (data) science community.

Mathematical reasoning and modeling are also crucial for avoiding the waste of human and computational resources caused by brute-force trial-and-error methods. The current race in generative AI, which relies on training with increasingly more massive data sets, is environmentally unsustainable due to its insatiable demand for energy. Theoretical understanding, coupled with mathematical reduction, is vital—if not the only way—to address this issue. I have announced many times over the last 5 years that 75% of the articles in HDSR do not contain Greek letters in an effort to challenge the stereotype that any publication with ‘data science’ in its title is only for those unafraid of mathematics. However, it is equally important to highlight that HDSR is also a platform for highly mathematical content. Therefore, I strongly encourage those who are besotted by Greek letters to consider HDSR as a venue to display their passion, as long as their loving results have broad implications for the data science ecosystem.

Chances Are There Is an Article for Your Interest (or Job) . . .

To further reach the goal to communicate everything data science and data science for everyone, HDSR also publishes several thematized columns in each (regular) issue. The inaugural issue featured two columns: Diving Into Data for mini tutorials on concepts, methods, and tools, and Mining the Past for brief histories of data science. Over the 5 years, HDSR has developed nine (and counting) thematized columns to meet the diverse interests of inhabitants of the data science ecosystem, for different sectors (e.g., government, industry, education), and different interests (e.g., scientific research, historical events, leisure activities). Four of these columns appear in the current issue 6.3.

In alphabetical order, the first column featured in this issue, Effective Policy Learning, was launched in the Winter 2020 issue, and it is devoted to data science for policy making and makers. The article by Daniel O’Brien and Kimberly Lucas (2024), “How To Foster a Civic Research Community: Lessons From Greater Boston's Annual Insight-to-Impact Summit,” is the first of its kind because it focuses on building data-driven collaborations and research communities at regional levels such as cities. (Previous articles in this column have discussed policy and community building issues at international, national, and state levels.) I am grateful to the authors for sharing their rich experiences and insight with HDSR’s readership, many of whom probably are unaware of terms such as “urban informatics” or at least do not know what the terms encompass.

The second column, Meta Data Science, debuts in this Summer 2024 issue with the thought-provoking essay, “To Be a Frequentist or Bayesian? Five Positions in a Spectrum,” written by column co-editor Hanti Lin (2024). Even (or especially?) those who have never heard of the term frequentist or Bayesian may surmise that this is some kind of philosophical pondering. Indeed, it is, as the column is designed for short essays on philosophical issues around the practice and theory of data science, which is full of ethical, epistemological, and metaphysical questions, all domains of philosophy since the dawn of human intellectual pursuits.

I am therefore deeply grateful to the Meta Data Science column co-editors, Sabina Leonelli and Hanti Lin, two ardent philosophers of data science, for their enthusiasm to launch this new column. Echoing their call in their first Column Editors’ Note (found in Lin, 2024), I invite all those who are eager to demonstrate that human intelligence is irreplaceable—or are eager to provide data for training artificial philosophers—to submit their proposals to the column co-editors via the HDSR online submission system.

The third column, Minding the Future, kicked off in the Spring 2020 issue and is designed for building pipelines for data science, especially at the precollegiate level. The article “Bringing Students’ Lives Into Data Science Classrooms” by David Weintrop and Rotem Israel-Fishelson (2024) is a perfect demonstration of the column’s mission and of effective ways to engage (high school) students. Enticed learning is far more effective—and fun—than enforced learning. Among all the benefits discussed or alluded to in the article about using data about students’ lives and their interests, the students’ inherent contextual understanding is the most salient one to me, for reasons discussed earlier. Being informed by their own lived experiences (physically and virtually), the students are also in an informed position to be skeptical, and to call BS BS, to borrow an effective term from a popular data science course at University of Washington (Bergstrom & West, n.d.). Such practices alone are a great service to our society, as they can help better prepare future generations to spot and combat misinformation and disinformation, including identifying when these labels themselves are being used to manipulate people’s minds for ulterior gains.

Reinforcing Reproducibility and Replicability is the last column featured in the current issue, and it aims to forge a path to better (data) science. Fastest growing of the columns since its introduction exactly a year ago, the column has already published a dozen articles, starting from a collection of eight articles in the introductory special theme in the Summer 2023 issue. The growing content in and attention to this column should come as no surprise to anyone who has been deeply concerned with substantial numbers of unverifiable or verifiably false results that have appeared in (top) scientific, medical, and more broadly research journals—see the discussions in this column as well as the special theme on Reproducibility and Replicability in the Fall 2020 issue of HDSR.

In the current issue 6.3, Stuart Buck (2024) argues that “We Should Do More Direct Replications in Science” as a way to ensure more reliable scientific results. As the column editor, Lars Vilhuber, writes in his Column Editor’s Note for Buck, direct replication is “an audit, a verification, a test.” Buck argues that more research funding needs to be allocated specifically for direct replication or auditing, which would provide a direct incentive for researchers to not focus only on conducting new studies. Indeed, with the current rate of studies that would fail verification tests and without increased and effective effort to curtail them, racing for new results would only exacerbate the “replication crisis,” especially in view of the increasing reliance on generative AI technologies, whose unreliability for scientific purposes is a whole can of worms in itself.

Buck’s column article is highly thought-provoking—and I also hope at least some funding agencies would find it compellingly action-promoting. It documents arguments both against and for direct replication, and hence it is a particularly worthwhile reading for those who find one side of the argument obvious (e.g., why would anyone be against verifying a scientific result, given science is about revealing and understanding the truth?) Again, this editorial is getting very long—and it still needs to get to the moon—and therefore I will not summarize the arguments here. Instead, I merely aim to express my deep appreciation of Buck’s main argument that “only direct replication can help us figure out puzzling anomalies about which contextual factors are important to a given scientific result,” because context changes everything.

The Mission of HDSR—A Moon Shot?

Reflecting on the anticipated and incidental events during HDSR’s groundbreaking first 5 years, I started to wonder if the single-day delay in its launching was more than an incidental reminder of how history can change its course due to random events. We were fully ready to launch on July 1, 2019, but the marketing person commissioned for media outreach unexpectedly took a personal day. Our publisher, MIT Press, recommended that we not launch without synchronized media announcements.

Appreciating their expertise, we followed the sage advice. Nevertheless, I was disappointed, given our intensive efforts to meet the July 1 launch. Unknown to me then, however, July 2, 2019, was the day NASA successfully conducted a critical test for its Artemis program, part of the initiative to return humans to the moon by 2024. While the return has been rescheduled, the coincidence of the two events reminded me of the moon-shot nature of HDSR’s mission: to be a global forum for everything data science and data science for everyone. For those who seek celestial signs, July 2, 2019, also witnessed a total solar eclipse, a powerful reminder of the significance of the moon.

All pseudo-selenology aside, while HDSR celebrates its remarkable achievements in its groundbreaking quinquennium and thanks—literally—to an incalculable number of readers for their support, it remains many millions of readers away from its global aspiration. The HDSR’s moon shot might forever remain as a moon shot. But like many moon-shot projects we cheer on, from curing cancer to changing climate, every small step counts, as they might just one day accumulate into one giant leap (for mankind—for those who are too young to know Neil Armstrong’s moon landing declaration).

Until then, please celebrate HDSR with us by sharing your favorite articles from HDSR with the youngest people you care about, and better, by encouraging them to become your favorite authors for HDSR and beyond.


Disclosure Statement

Xiao-Li Meng has no financial or non-financial disclosures to share for this editorial.


References

Babaei, G., Banks, D., Bosone, C., Giudici, P., & Shan, Y. (2024). Is ChatGPT more biased than you? Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.2781452d

Banks, D., Albert, L. A., Caulkins, J. P., Frühwirth-Schnatter, S., Greig, F., Raftery, A., & Thomas, D. (2020). A conversation about COVID-19 with economists, sociologists, statisticians, and operations researchers. Harvard Data Science Review, (Special Issue 1). https://doi.org/10.1162/99608f92.7fa08812

Banks, D., Bosone, C., Carpenter, B., Shah, T., & Shi, C. (2024). Large language models: Trust and regulation. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.1be2ab6e

Banks, D., de Melo, G., Gong, S. (Xinwei), Kwon, Y., & Rudin, C. (2024). Data scientists discuss AI risks and opportunities. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.740dcedb

Banks, D., Ellenberg, S. S., Fleming, T. R., Halloran, M. E., Lawson, A. B., & Waller, L. (2020). A conversation about COVID-19 with biostatisticians and epidemiologists. Harvard Data Science Review, (Special Issue 1). https://doi.org/10.1162/99608f92.a1f00368

Bergstrom, C. T., & West, J. (n.d.). Calling bullshit: Data reasoning in a digital world. Retrieved July 31, 2024, from https://callingbullshit.org/syllabus.html

Berman, F., Banks, D., Jordan, M. I., Leonelli, S., & Minow, M. (2024). Amid advancement, apprehension, and ambivalence: AI in the human ecosystem. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.2be2c754

Bloomberg Media. (2023, September 12). Bloomberg Media Launches New Brand Campaign “Context Changes Everything” [Press release]. https://www.bloombergmedia.com/press/bloomberg-media-launches-new-brand-campaign-context-changes-everything/

Borgman, C. L. (2019). The lives and after lives of data. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.9a36bdb6

Buck, S. (2024). We should do more direct replications in science. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.4eccc443

Floridi, L., & Cowls, J. (2019). A unified framework of five principles for AI in society. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.8cd550d1

Gligorić, K., Piccardi, T., Hofman, J. M., & West, R. (2024). In-class data analysis replications: Teaching students while testing science. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.f9720d1f

Haas, L., Hero, A., & Lue, R. A. (2019). Highlights of the National Academies report on "Undergraduate Data Science: Opportunities and Options.” Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.38f16b68

Jain, P., Ding, C., Rudin, C., & Hu, X. (2024). A self-supervised algorithm for denoising photoplethysmography signals for heart rate estimation from wearables. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.8636cb81

Jordan, M. I. (2019). Artificial intelligence—The revolution hasn’t happened yet. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.f06c6e61

Kotz, B. (2020). The opportunities of two-year college data science. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.75aed58b

Kreuter, F., Ghani, R., & Lane, J. (2019). Change through data: A data analytics training program for government employees. Harvard Data Science Review, 1(2). https://doi.org/10.1162/99608f92.ed353ae3

Leonelli, S. (2019). Data governance is key to interpretation: Reconceptualizing data in data science. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.17405bb6

Lin, H. (2024). To be a frequentist or Bayesian? Five positions in a spectrum. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.9a53b923

Lin, S., Paquette, E., & Kolaczyk, E. D. (2024). Differentially private linear regression with linked data. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.6a5d7a87

Lo, A. W., Siah, K. W., & Wong, C. H. (2019). Machine learning with statistical imputation for predicting drug approvals. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.5c5f0525

Martinez, W., & LaLonde, D. (2020). Data science for everyone starts in kindergarten: Strategies and initiatives from the American Statistical Association. Harvard Data Science Review, 2(3). https://doi.org/10.1162/99608f92.7a9f2f4d

McNutt, M. (2020). Self-correction by design. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.32432837

Meng, X.-L. (2020a). COVID-19: A massive stress test with many unexpected opportunities (for data science). Harvard Data Science Review, (Special Issue 1). https://doi.org/10.1162/99608f92.1b77b932

Meng, X.-L. (2020b, January 21–24). Data Science: What is it not? [Keynote address]. ComputeFest 2020, Cambridge, Massachusetts, United States. https://www.youtube.com/watch?v=0CpNoHwVICE

Monajemi, H., Murri, R., Jonas, E., Liang, P., Stodden, V., & Donoho, D. (2019). Ambitious data science can be painless. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.02ffc552

National Academies of Sciences, Engineering, and Medicine. (2018). Data science for undergraduates: Opportunities and options. The National Academies Press. https://doi.org/10.17226/25104

Nissenbaum, H. (2004). Privacy as contextual integrity. Washington Law Review, 79(1), 119–158. https://digitalcommons.law.uw.edu/wlr/vol79/iss1/10

Nissenbaum, H. (2009). Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press.

O’Brien, D. T., & Lucas, K. D. (2024). How to foster a civic research community: Lessons from Greater Boston’s annual Insight-to-Impact summit. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.cfd61a5a

Pasquetto, I. V., Borgman, C. L., & Wofford, M. F. (2019). Uses and reuses of scientific data: The data creators’ advantage. Harvard Data Science Review, 1(2). https://doi.org/10.1162/99608f92.fc14bf2d

Rosenberg, J., & Jones, R. S. (2024). Data science learning in grades K–12: Synthesizing research across divides. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.b1233596

Sanders, N. E., Ulinich, A., & Schneier, B. (2023). Demonstrations of the potential of AI-based political issue polling . Harvard Data Science Review, 5(4). https://doi.org/10.1162/99608f92.1d3cf75d

Spector, A. (2024). Data science and AI in context: Summary and insights. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.cdebd845

Spector, A., Norvig, P., Wiggins, C., & Wing, J. M. (2022). Data science in context: Foundations, challenges, opportunities. Cambridge University Press.

Weintrop, D., & Israel-Fishelson, R. (2024). Bringing students’ lives into data science classrooms. Harvard Data Science Review, 6(3). https://doi.org/10.1162/99608f92.6d2aec03

Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.e26845b4


©2024 Xiao-Li Meng. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.

Comments
0
comment
No comments here
Why not start the discussion?