Skip to main content
SearchLoginLogin or Signup

Data Science Experiences and Emotions:  From Lonesome to Awesome

Issue 6.2 / Spring 2024
Published onMay 01, 2024
Data Science Experiences and Emotions:  From Lonesome to Awesome
·

With the launch of issue 6.2, HDSR celebrates its first quintessential milestone: 20 regular and 4 special issues over 5 formative years since its inception on July 2, 2019. Encapsulating my 5-year HDSR experience with an apt adjective would require more prompt engineering than I can afford. I therefore will continuously refer to it as my “start-up” experience, despite a friendly reminder that without having to mortgage my house to launch HDSR, I have not earned the title ‘startupper’ (Meng, 2019).

However, irrespective of its label, my experience in data science, enriched through HDSR, has impacted my emotions and behaviors. My curiosity in everything data science has significantly outpaced my capacity, making me more impatient. There are so many subjects to learn, ideas to explore, and partnerships to build. I am acutely aware that my anxiety could be simply a sign of aging. That very thought, however, only exacerbates my anxiety.

The feeling of being overwhelmed also has enhanced my empathy. I used to be agitated by any reinvented wheel, especially of an inferior kind. I wanted to throw a book at the re-inventor—did you ever read? After reading myriad submissions to HDSR—and their review reports—from essentially all walks of data science (and beyond), I now see wheel reinventing as an evolutional inevitability, especially for a blossoming ecosystem. Similar challenges can emerge in many areas or disguises. When we are charged to provide prompt solutions or circumventions, it is not unlike the situation where we devour unhealthy fast food when we are starving or take an illegal U-turn when we are running late for a job interview. Even for those of us who are fully health conscious or risk averse, pressures of many kinds can push us to boundaries from which normally we would stay away.

No, I am not encouraging or excusing redundant efforts in data science or any endeavors, just as no one should advocate for unhealthy or reckless behaviors. I am merely reminding ourselves to appreciate the behavioral and emotional aspects of—and their variations with the roles we take in—the data science enterprise. As a statistical reader, I was depressed by the excitement displayed in an article on speeding up computation by replacing probabilistic derivations with algebraic manipulation, but without apparently realizing that it is a case of Fisher’s fiducial sleight-of-hand. As a statistical researcher, I have analogized distribution forms with Platonic forms for capturing similarities to encourage quantitative contemplations of qualitative concepts, knowing well that whatever deep thoughts I had formed are likely (at the best) a deepfake to philosophers.

In either case, my emotions and behaviors formed and shaped my data science experiences. Those who understand my emotions and behaviors, instead of ignoring them, would have an easier time engaging me and distilling more introspection in me for my self-betterment. This experiential and emotional dimension—the 'double E'—of data science is rarely featured in data science publications. Yet multiple articles in this issue remind us of its pivotality in appreciating and nurturing the data science ecosystem. Hence, I venture to thematize the articles in this issue via this EE-prism.

A Data Visualization Journey Starts With an Emotional Response . . .

Among all the ways to display data or to tell a data story, a well-designed data visualization must be the first choice of many. An aesthetically pleasing display can immediately grab people’s attention, even those who have data phobia. But most importantly, we humans apparently associate our comprehension and mind power with vision more than any other sense, as partially evident in English language. When we understand something, we say ‘I see,’ not ‘I taste’ or ‘I smell.’ When we want to separate essences from secondaries, we tell ourselves ‘Let’s see the big picture,’ not ‘Let’s smell the strong scent.’ We say ‘It is my view…,’ ‘Here is my vision…,’ ‘In the light of…,’ ‘An eyewitness account…’—well, you get the picture. (Indeed the connection between knowledge and lights is at the core of Illuminationism; see Chapter 27 of Adamson, 2019)

Given we are almost genetically wired to be persuaded or fooled by what we see, how to present a data visualization effectively and scientifically, or at least to avoid misleading impression, is critical especially for communicating evidence to the public. But an effective data visualization must pick and choose what to present in order to be, well, effective. But how are such choices made in practice? What are the guiding principles? The article by Kathleen Gregory, Laura Koesten, Regina Schuster, Torsten Möller, and Sarah Davies (2024), “Data Journeys in Popular Science: Producing Climate Change and COVID-19 Data Visualizations at Scientific American,” investigates such issues with empirical evidences guided by the conceptual framework of data journeys, which refers to a myriad of human, technical, and institutional elements working together “to help data to travel across space, time, and social contexts,” and to inform and persuade via data visualization.

The EE dimension enters the data journeys from the very beginning. As Gregory et al (2024) reported, Scientific American initially chose story ideas based on its staff members’ interests and events in their personal lives, and then “Editors use their own emotional responses as an initial gauge to estimate the reception of a story.” As a staff responded during a video conference interview conducted byGregory et al (2024), “There’s the data, but there's an infinite number of stories we could tell […] If we have an emotional response, that’s kind of the first level of oh, maybe this is something that people will also respond to.”

Such an approach might be seen as most ironic for a magazine that aims to popularize science and scientific methods, because estimating public reaction by personal emotions is as anti-scientific as one can get, at least from the perspective of statistical inference. But this is precisely why we need to explicitly recognize the roles emotions and experiences play in the data science ecosystem, which does not evolve according to some scientific or any kind idealized prescription. Rather, the data science ecosystem is driven by countless human activities, decisions, and interactions (with each other and with machines), which in turn are influenced by our experiences and emotions.

From Museum Visitors to Museum Users (and Donors)

Awesome is an awesome phrase. But I don’t recall having seen it in any scientific publication. Whereas this could entirely be due to my lack of reading (or memories), HDSR must have set a Guinness World Record in terms of the frequency of the word awesome appearing in a single publication in data science. In the interview I conducted, “A Conversation with Tim Ritchie, President of the Museum of Science, Boston,” the phrase ‘awesome’ appeared 17 times (Ritchie & Meng, 2024). No, the frequency was not due to two middle-aged men tried to impress each other (and the audience) by recalling their teenage exuberances. Rather, it is at the core of President Ritchie’s vision for a science museum: “We want people to come here and believe that science is awesome, and we want them to believe that they are awesome and that their ability to use science can make for an awesome future.”

I was simply awed by the trinity of awesome in this one sentence, and even more so by what Boston’s Museum of Science has accomplished under the leadership of President Ritchie, especially considering that he took the helm at arguably the worst time for any museum, February 2020, when a global pandemic was in the making. I visited the museum when I was a student, and then again when my children were elementary students. I always enjoyed my visits, but the awe feeling only came recently, when President Ritchie kindly provided the HDSR team a tour as a part of our discussion about potential partnerships between the Museum of Science and HDSR. Thoughts of a partnership were inspired not in the least by that the museum’s emphasis of “Here, science is a life-changing journey of discovery for everyone” (Museum of Science, 2020) shares the same commitment to engage the public as HDSR’s aspiration to feature everything data science and data science for everyone.

My awe sentiment was generated by the vastness of the exhibitions in time and space, from colossal fixtures that could have crashed homo sapiens (but they didn’t) to invisible algorithms that some fear may take over the human race (but they won’t). More importantly, programs such as Data Choreographics and Train an AI program (Museum of Science, 2022) are no longer exhibitions, but interactive experiences, transforming those like me from an awed visitor into an ardent learner.

And that is where President Ritchie’s vision is heading:

“I'd like a museum not to be a place you visit. I like it to be a place you use, like you use a library. … You could use the Museum of Science, because we did have data visualization things, for instance, that were updated all the time.” (Ritchie & Meng, 2024)

Providing and sustaining such user or learning experiences, however, is extremely expensive. Fundraising, therefore, was another frequent phrase that appeared in this interview. President Ritchie emphasized the different natures of data for different revenue-generating mechanisms and offered the most succinct insight for successful fundraising: “People give to what they value when they’re asked by somebody they trust” (Ritchie & Meng, 2024). Data play a key role in identifying people with resources, what they value, and whom they trust. As President Ritchie pointed out, museums are still among the most trusted institutions, despite the decline of public trust in other institutions such as higher education. Sustaining such trust by creating more awesome experiences for the public therefore forms a virtuous cycle, an ultimate mechanism for the viability and longevity of a public institution. Any leader who can build a virtuous cycle deserves an enthusiastic awesome!

Supporting Lonely Data Scientists to Become Community Builders

On the day prior to writing this editorial, I was given the opportunity to speak in a large ballroom at the Open Data Science Conference (ODSC) East 2024. I choose the topic “Being, Training, and Employing Data Scientists: Wisdoms and Warnings from Harvard Data Science Review” for an obvious reason—it is a great opportunity to inform the broad data science community, especially those who work outside of the academic world, about HDSR. HDSR’s mission of connecting people from all walks of the data science community apparently resonates well with many audiences. A long line formed after my talk for my signature of the commemorative issue of HDSR and for additional Q&A.

A young data scientist sought my advice on how to deal with his loneliness as the sole data scientist in a small firm. He said he was not alone in that aspect, because he is aware of similar situations for several data scientists in other similarly sized firms. Apparently, these firms hire a data scientist because everybody else is doing the same, but they do not really have a clear—or even rough—plan or organizational surrounding for the hired.

Another data scientist had a similar question. She is in a medium-sized firm with about half a dozen data scientists. But they do not meet as a group, and all her meetings have been one-on-one with her boss, who has never been clear about what the group’s mission is and seems uninterested in building anything that can make the group more than a collection of siloed individuals.

This issue and feeling of isolation is neither new, nor unique to industry. The American Statistical Association (ASA) has an interest group, Isolated Statisticians, which helps to build networks for statisticians in departments with sole or very few statisticians since 1991. In 2009, I was invited to present “Harvard Experiments” (on statistical education and pedagogies) to the New England Isolated Statisticians Meeting held at Pine Manor College (which sadly closed in 2020), and hence had some in-depth discussions with isolated statisticians on their struggles and wish lists.

Fortunately—or perhaps rather unfortunately—I did not need to rely solely on what I had learned 15 years ago to answer these two data scientists. I also referred them to “In the Academy, Data Science Is Lonely: Barriers to Adopting Data Science Methods for Scientific Research” by Elle O'Brien and Jordan Mick (2024) for more timely advice and suggestions. This article was based on investigating the loneliness issue in one university, and it is with respect to those who “lacked the expertise to confidently implement and interpret new methods” in data science. Nevertheless, some lessons learned, and corresponding recommendations made are of broad implications.

For example, a common suggestion to addressing the issue of professional loneliness, just as in dealing with isolation in social life, is for the isolated individuals to look for the like-minded or similarly situated to build a network of their own. However, as O'Brien and Mick (2024) reported, “Many sought to bootstrap a community on their own, with mixed results.” They therefore focus on their recommendations to the institutions (e.g., departments, institutes, and universities) instead of the individuals, such as putting emphasis on shared interest or process (e.g., of learning or managing) instead of professional identities (e.g., specialties or skill levels) when organizing networking events.

The ASA’s isolated interest group was a grassroots effort started by about a dozen statisticians in non-statistics departments. Its success and sustainability over three decades are due to both the ongoing efforts made by generations of isolated statisticians and the institutional support by the ASA. It is a good example to echo O'Brien and Mick’s (2024) call for institutional support, with which we can not only reduce the professional loneliness, but also motivate lonely data scientists into community builders.

Expanding Our Community for Reducing (the Anxiety Over) Irreproducibility

The column article “The Role of Third-Party Verification in Research Reproducibility” by Christophe Pérignon (2024) discusses another issue that can be addressed by expanding our community: improving the ability and quality of reproducibility verification, an issue that has often frustrated me both as a researcher and as an editor. Like any researcher who cares about their professional integrity and reputation, it is in my self-interest to avoid publishing any irreproducible results. Here, ‘reproducibility’ refers to the ability to achieve consistent results between those reported in an article and those obtained from a verification run using the same data and methods as originally reported. This requirement is considerably less stringent than scientific replicability (see the special theme on reproducibility and replicability in HDSR) as it focuses solely on the internal consistency of an article, rather than its broader scientific validity.

During my PhD days, I learned the hard way that ensuring reproducibility is far more challenging than one might initially think upon hearing the phrase ‘verifying numerical results.’ A particularly striking example occurred when I needed to compute a symmetric matrix—a covariance matrix essential for uncertainty quantification—by multiplying two matrices (Meng & Rubin, 1991). It is extremely unlikely for errors in calculating two matrices to fortuitously cancel each other out and yield a symmetric product. Consequently, I was very confident that any symmetric result must be correct. However, one day, while checking something unrelated, I accidentally discovered, to my horror, that I had made rather ‘cunning’ double coding errors which coincidentally produced symmetry, but not the correct results. To this day, the thought of how close I came to publishing my first major article with erroneous results—which likely would have passed the review process without independent verification—still sends chills down my spine.

With several such lessons in mind, I am always nerves about publishing numerical results without independent verification. It therefore frustrates me constantly and greatly that I no longer have the time or skills to independently verify the numerical results produced by my co-authors. I surely trust them far more than I trust myself for anything involving computational implementation—but trusting is not a means to ensure reproducibility. Multiple self-checks by my co-authors help to tame my anxieties, but again, they do not give me peace of mind as an independent verification would.

Relying on reviewers to ensure reproducibility is generally unrealistic. Very few reviewers have the time to do so, even if they are willing in principle. Occasionally reviewers and editors can identify issues by chance, as I recently experienced in a final editor-in-chief check on a submission. A formula displayed with unnecessary complications led my inquiry with the authors, which encouraged the authors to exercise due diligence to double-check all numerical results, revealing substantial irreproducibility and extreme sensitivities of some of the investigated methods to data. These issues would all have been overlooked if not for the awkwardly displayed formula that happened to catch my attention.

I recount these personal experiences to stress how pleased I am to see Pérignon’s (2024) article and, more importantly, the Certification Agency for Scientific Code and Data (cascad), which he co-founded with Christophe Hurlin. The cascad agency provides independent reproducibility verification for researchers prior to submission and assists other scientific stakeholders—such as journals and funding agencies—in verifying the reproducibility of the research they publish or support. While the current checks at cascad may not detect irreproducibility caused by subtle coding errors (such as the symmetrizing double coding errors that I accidentally created), the agency is a milestone achievement, one that surely will reduce my anxiety over irreproducibility once I can use it for my presubmission verification.

From ChatGPT Users to Generative AI Researchers

The arrival of ChatGPT has brought back some ‘coding emotion’ I used to experience before it became painfully clear that having me code was a waste of everyone’s time. Communicating with ChatGPT in general juxtaposes more emotions than encountered during a bittersweet moment, since being emotionally engaged by a machine—whether we realize it or not—is a feeling of its own category. As reported in a pair of XL-Files (Meng, 2023a, 2023b), I have experimented with ChatGPT for editorial work, research writing, literature search, and more. I wasted more time than I saved. I was too emotionally engaged, and I have little insight on how to prompt effectively.

Of course, many have had similar experiences. In my department, a young colleague organized a 10-week summer reading of many aspects about ChatGPT (as listed in Meng, 2023b), so we can learn from each other. The fruitfulness of this reading group is partially demonstrated by the article “AI and Generative AI for Research Discovery and Summarization,” written by two participating members, Mark Glickman and Yi Zhang (2024). As the title suggests, their article investigates the use of ChatGPT and more broadly generative AI tools for summarizing key points in research articles, and for simulating abductive reasoning, “which provides researchers the ability to make connections among related technical topics.” Such research tools are obviously very welcome, as long as the summaries and reasonings are reasonably reliable. Human oversights and judgments are always needed, but obviously a hallucinated summary would be only of negative value.

Glickman and Zhang’s (2024) article therefore confronts the issue of hallucination before presenting any other findings. After an overview of the research on hallucination and methods to reduce it, they emphasize the importance of human oversight and of employing a combination of methods—many are reviewed in the article—to ensure our generative AI-aided research findings are trustworthy.

Perhaps not coincidentally, another article in this issue investigates the trustworthiness of ChatGPT. By asking “How Is ChatGPT’s Behavior Changing Over Time?” Lingjiao Chen, Matei Zaharia, and James Zou (2024) compared the performance of two versions of ChatGPT—GPT-3.5 and GPT-4, the March 2023 version and the June 2023 version of each—for seven tasks, ranging from solving math problems to visual reasoning. The findings may be surprising, even disturbingly so, to people who have not had sufficient experiences with ChatGPT or the like.

First, neither the version update nor the passage of time guarantees improvement. For example, the June 2023 version of GPT-4 performed significantly poorer on some mathematical tasks than its March 2023 version, whereas for GPT-3.5, it was the opposite. And “both GPT-4 and GPT-3.5 had more formatting mistakes in code generation in June than in March” (Chen et al., 2024). Secondly, performance changes can be rather significant (e.g., over 30 percentage points) in such a short period. If we can consider ChatGPT versions as ‘beings,’ then we can say that they are very moody and emotionally volatile. Whether we can eventually increase their EQ without reducing their IQ, Chen et al.’s (2024) finding highlights the critical need to continuously monitor their behaviors, which at least partially reflect the behaviors of the human behind the algorithms (e.g., the human intervention to make ChatGPT politically more sensitive).

Appreciating Data Users’ Frustration and Data Curators’ Impediment

The U.S. Census Bureau has a daunting, if not impossible, task. The bureau is mandated by the United States Constitution to provide accurate decennial census data, and bounded by Title 13 of the United States Code to protect data privacy for individuals and businesses. In principle it is possible to balance the need of providing informative aggregated data while sufficiently preserving individual privacy, with differential privacy (DP) providing a mathematical framework for such a balancing act. But to implement in practice, especially for extremely complex projects like U.S. censuses, there are a myriad of challenges to address, from technical hurdles to political impediments, as detailed in HDSR’s special issue on differential privacy for the 2020 U.S. Census.

On the technical side, because DP protects privacy by adding a controlled amount of noise to the confidential census data, the resulting data, known as the Noisy Measurement File (NMF), will contain implausible measurements such as negative counts. Such measurement errors do not unduly impose challenges for specialists such as statisticians and quantitative social scientists, who want access to the NMF because it will allow them to conduct statistically valid analyses that properly account for DP-induced uncertainties. But for general consumption of the census data (e.g., for resources allocations at some local levels), anomalies like negative counts would create problems from perceptional to operational. The bureau therefore implemented a complex TopDown Algorithm (Abowd et al, 2022) to further process the NMF so the resulting data will satisfy a series of requirements, including eliminating negative counts.

This further processing, however, creates new problems. For example, for researchers who care about proper assessment of the uncertainty resulting from the census privacy protection step, the processing by the TopDown Algorithm makes the uncertainty assessment significantly more challenging to do so than with the NMF. This is because while the TopDown Algorithm is still technically transparent, being a sequence of rules that do not always follow statistical principles, it lacks statistical intelligibility for inference purposes (Gong & Meng, 2020).

The demand to have access to the NMF from the research community therefore is understandable. Yet the process of getting access to the NMF has been a frustrating one to at least some of the research community, resulting in a series of efforts, from open letter to litigation. McCartan, Simko, and Imai (2023) provided a brief account of this history and reported four key obstacles for using the released NMF. Whereas some of the obstacles are specific to how the NMF was created, the issue of lack of proper documentation is a common source of frustration that most data users have experienced at one time or another. McCartan et al. (2023) discusses strategies they adopt to address or circumvent these obstacles and makes four corresponding recommendations to the U.S. Census Bureau that they believe can significantly improve user experience, enabling users “to focus on developing methods to analyze noisy census measurements, and answering substantive research questions.”

Because HDSR is created as a forum for direct exchanges among all stakeholders of data science, which include data users and data creators and curators, I contacted the U.S. Census Bureau to offer an opportunity to respond toMcCartan et al. (2023). This inquiry led to an invitation to John Abowd, who was in charge of the bureau’s 2020 Census efforts and pioneered the use of DP for census data privacy protection. The fact that Abowd has completed his bureau appointment permitted him to respond without the usual constraints—or the lengthy clearance process—for governmental officials. His commentary, “Noisy Measurements Are Important; The Design of Census Products Is Much More Important” (Abowd, 2024) provides an insightful look into the obstacles the bureau faces as a data curator. Issues range from resources and task prioritization to releasing far more data than the official census products. Abowd invites the research community to help to better design the census official products, in addition to help improve the releasing of the NMF. Together with McCartan et al.’s original (2023) article and their (2024) rejoinder, this public exchange demonstrates the need and benefit of direct communication between data users and data curators. Such communications also support the U.S. Foundations for Evidence-Based Policymaking Act of 2018 (2019; often referred to as the Evidence Act) mandate for the governmental agencies to directly engage the users of the data supplied by the agencies, a central topic for HDSR’s special issue on Democratizing Data.

Enhancing Learning Experience With Relatable Content

With the public awareness of the benefit of data science and AI and the corresponding employment opportunities, we mostly have passed the era when we had to work hard to persuade and recruit students to study data-rich disciplines. As an anecdote, my department now has over 250 concentrators (Harvard University’s term for majoring) compared to less than 10 it had 20 years ago when I became the department chair. However, this does not automatically imply that we are doing a better job teaching data-rich courses. Rather, even if we had done a terrific job in teaching them before, the greatly increased student body should push us to do even better, especially regarding providing beneficial learning experiences to students from much more diverse backgrounds than ever before.

The article by Rotem Israel-Fishelson, Peter Moon, Rachel Tabak, and David Weintrop (2024), “Understanding the Data in K-12 Data Science,” serves as a timely reminder that we have lot to do to meet such needs. As they emphasize, “the data sets used in K-12 data science curricula directly shape the way students are introduced to the discipline and shape their impressions of the importance and relevance of the field.” Unfortunately, we are doing a rather poor job in selecting the appropriate data sets to inspire and engage students, as Israel-Fishelson et al. (2024) reported after analyzing nearly 300 data sets—with respect to their topics, recency, size, and proximity—in four high school curricula for introductory data science educations. This finding is unsurprising, unfortunately, because most data sets used in classrooms are chosen by the textbook authors and by those who teach, and without consultation with the students.

I gather I won’t offend too many of my fellow teachers and faculty members (anywhere) to say that when we choose a textbook for a course, we pay the most attention to its topic coverage, difficulty levels, how well it is written, and so on. These are all critical considerations, certainly. We also pay attention to the broad nature of the data and examples if we teach a course with a target student group. When I taught an introductory statistical course to students in life sciences, most data and examples were from the biological and medical world. But we pay much less attention to the specific choices of data sets and examples with respect to the background and interest of the students we teach. Doing so, especially doing so well, would require at least some direct input from the students to understand their interest and to have their feedback on what works and why. Currently, this is not a common practice, as far as I am aware of. I therefore particularly appreciate the call by Israel-Fishelson et al. (2024) “for future revisions and innovations in data science instruction that better situate instruction in the data-rich lives of today’s students.”

One way to achieve this goal is to have many more textbooks—with many different topic converges and material levels—that come with very diverse selections of data sets and corresponding projects. A great example of such textbooks is Veridical Data Science by Bin Yu and Rebecca Barter (2024), as reviewed by Yuval Benjamini and Yoav Benjamini (2024) in this issue of HDSR. As Benjamini and Benjamini summarize, a special feature of the book is that it features “many data projects, case studies and data-based exercises, ranging from environmental, medical and social sciences to shopping online,” which “allows the reader to develop familiarity with the particular scientific questions and data collection issues of each data set.” The fact that this is an all-level book and well written makes it even more appealing because it can benefit “a wide and diverse readership,” as emphasized by Benjamini & Benjamini (2024). The broad readership is especially important to convey the central message of the book, that is, regardless of the topic or people involved, findings and conclusions from a data science endeavor need to be demonstratively trustworthy.

This goal is concomitantly a minimal and a very high bar. It is a minimal bar because who would want any scientific exercise to produce non-trustworthy results? (Even for those with ulterior motive or nefarious intent, they still don’t want themselves to be fooled, especially if they want to fool or harm others.) But whenever data science is needed to address a problem, we are almost surely in a situation where empirical evidence is essential. But empirical evidence is never foolproof, as otherwise it won’t be labelled empirical. To demonstrate the trustworthiness of a data science finding therefore takes a host of human engagements (in contrast to let the machine learn or run), including careful planning, execution, stress testing, replication, verification, triangulation, interpretation, communication, and more. The stability check routinely implemented in and encouraged by Yu and Barter (2024) is a stress test, from which one can assess not only the trustworthiness of the empirical results, but also of the employed models, methods, and algorithms.

Among all the ways to conduct empirical studies, randomized controlled trials (RCTs) are considered the gold standard because they are designed to statistically eliminate confounding factors of any kind, whether one understands or even realizes their existence. The Diving into Data column article in this issue, “Decoding Randomized Controlled Trials: An Information Science Perspective” by Pavlos Msaouel (2024), provides a very informative tutorial. It is very informative not only because of its rich content and lucid writing, but also because its information science perspective makes a key statistical quantity for RCTs, namely, p value, more relatable for the broad scientific community and even the general public.

Specifically, by converting p values into s values, which aim to measure information bits, we can help the investigators to understand better how strong or weak the information provided by an RCT experiment is. The concept of information bit can be further explained in intuitive ways via the game of 20 questions, where each bit of information can be perceived as the maximum information one can gain (on average) by asking each question. Indeed this is how Britannica Kids explains information theory (see Encyclopædia Britannica, 2020).

A further example of the effectiveness of engaging learners via relatable content is the exquisitely produced tutorial by Grant Sanderson (a.k.a. 3Blue1Brown [2022], a YouTube channel that I highly recommend to any age) on information science, as mentioned in Msaouel (2024), which uses the game Wordle to engage the public. I found it accidentally when I was seeking an engaging video to keep me on a treadmill. By the time the treadmill ended without me forcing it to do so, I had already formed an idea of how to explain the privacy loss budget in differential privacy, the so-called epsilon value, to policymakers and the public. Privacy loss is information gained by the hacker. While ‘epsilon’ is Greek to many, the game of Wordle or 20 questions are relatable even by those who have never played them (apparently I enjoyed too much of editing and writing for HDSR that I actually had never heard of Wordle until watching Sanderson’s tutorial).

A Grand Finale: Form Data Users to Data Connoisseurs

As usual, this editorial is getting too long. Reading all these great articles has similar emotional effects on me as attending a great dinner or wine tasting event with many friends; there is always one more story to share or hear. To thank all of you who read this far, I saved an article from this issue that is particularly dear to my heart because it bridges two of my three most favorite subjects: data science and enology. (I will save the third one for a future editorial, when a proper theme arises).

The article, “Needs and Opportunities for Data in Wine: Boots-on-the-Ground Solutions from Napa to Maryland,” by Cathy Huyghe (2024), is about more than using data for wine. Huyghe, being the co-founder and CEO of Enolytics, a company that aims to empower wineries with data, as well as an avid writer in the intoxicating wine world, is uniquely positioned to highlight the similar challenges in dealing with data and wine. These include that both are complex, both need specialized training to engage or even appreciate, and that both are hard to learn.

Indeed a key reason that I appreciate wine as a statistician is because my wine experience is determined by a myriad of factors and most of them with great variations. Wine quality (and quantity) is significantly impacted by macro factors such as climate change and micro elements such as yeast. But even with the same wine, the bottle-to-bottle variations can be the difference between toasting and tossing, all depending on how the wine was bottled, transported, and stored. Worse—or better depending on what one values—wine experience is an emotional affair, since it also depends on the occasion and with whom one is sharing a bottle. As I wrote in another editorial, information and uncertainties are two sides of the same coin, variations (Meng, 2020). Wine, therefore, is a perfect subject for challenging statistics and more broadly data science.

But this clearly does not mean that one must know wine like a wine master in order to have an enjoyable wine experience. There are plenty of wine connoisseurs who have never had formal training or even an informal one. Similarly, one can develop good sense or even acumen about data without a formal degree in data science. And indeed, many editorial board members for HDSR do not have a degree that is generally perceived as a data science degree (e.g., computer science, statistics, X-informatics). By demystifying wine and data side by side, Huyghe’s (2024) article provides suggestions on how “wine people” can become “data people.” This editorial is reaching 6,000 words, so I better leave readers to dive into Huyghe’s article instead of reflecting on these suggestions. Inspired, however, I venture to make a big toast to the data science community: May more data users become data connoisseurs.


Disclosure Statement

Xiao-Li Meng has no financial or non-financial disclosures to share for this editorial.


References

3Blue1Brown. (2022, February 6). Solving Wordle using information theory [Video]. YouTube. https://www.youtube.com/watch?v=v68zYyaEmEA

Adamson, P. (2019). Medieval philosophy: A history of philosophy without any gaps (Vol. 4). Oxford University Press.

Abowd, J. (2024). Noisy measurements are important; The design of census products is much more important. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.79d4660d

Abowd, J., Ashmead, R., Cumings-Menon, R., Garfinkel, S., Heineck, M., Heiss, C., Johns, R., Kifer, D., Leclerc, P., Machanavajjhala, A., Moran, B., Sexton, W., Spence, M., & Zhuravlev, P. (2022). The 2020 Census Disclosure Avoidance System TopDown Algorithm. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.529e3cb9

Benjamini, Y., & Benjamini, Y. (2024). A review of “Veridical Data Science” by Bin Yu and Rebecca L. Barter. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.3515c52d

Encyclopædia Britannica. (2020, March 18). information theory. Britannica Kids. https://kids.britannica.com/students/article/information-theory/275060

Chen, L., Zaharia, M., & Zou, J. (2024). How is ChatGPT’s behavior changing over time? Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.5317da47

Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174

Glickman, M., & Zhang, Y. (2024). AI and generative AI for research discovery and summarization. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.7f9220ff

Gong, R., & Meng, X. L. (2020). Congenial differential privacy under mandated disclosure. In Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference (pp. 59–70). Association for Computing Machinery. https://doi.org/10.1145/3412815.3416892

Gregory, K., Koesten, L., Schuster, R., Möller, T., & Davies, S. (2024). Data journeys in popular science: Producing climate change and COVID-19 data visualizations at Scientific American. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.141c99cf

Israel-Fishelson, R., Moon, P., Tabak, R., & Weintrop, D. (2024). Understanding the data in K-12 data science. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.4f3ac3da

McCartan, C., Simko, T., & Imai, K. (2023). Making differential privacy work for census data users. Harvard Data Science Review, 5(4). https://doi.org/10.1162/99608f92.c3c87223

McCartan, C., Simko, T., & Imai, K. (2024). Rejoinder: We can improve the usability of the census Noisy Measurements File. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.f9f4b9a4

Meng, X.-L. (2019) Data science: A startup. Harvard Data Science Review, Commemorative issue.

Meng, X.-L. (2020). Information and uncertainty: Two sides of the same coin. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.c108a25b

Meng, X.-L. (2023a) XL-Files: ChatGPT — First contact. IMS Bulletin, 52(3), 10–11. https://imstat.org/2023/03/31/xl-files-chatgpt-first-contact/

Meng, X.-L. (2023b) XL-Files: Tenure by GPT-n — Make it or fake it. IMS Bulletin, 52(6), 12–14. https://imstat.org/2023/08/31/xl-files-tenure-by-gpt-n-make-it-or-fake-it/

Meng, X. L., & Rubin, D. B. (1991). Using EM to obtain asymptotic variance-covariance matrices: The SEM algorithm. Journal of the American Statistical Association, 86(416), 899–909. https://doi.org/10.2307/2290503

Msaouel, P. (2024. Decoding randomized controlled trials: An information science perspective. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.49e3c961

Museum of Science. (2020, October 30). Exhibits. https://www.mos.org/visit/exhibits

Museum of Science. (2022, September 15). Museum of Science, Boston reveals how artificial intelligence is everywhere in new exhibition, exploring AI: Making the invisible visible [Press release]. https://www.mos.org/press/press-releases/Exploring-AI-Making-the-Invisible-Visible-Exhibit-Announcement

O’Brien, E., & Mick, J. (2024). In the academy, data science is lonely: Barriers to adopting data science methods for scientific research. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.7ca04767

Pérignon, C. (2024). The role of third-party verification in research reproducibility. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.6d4bf9eb

Ritchie, T., & Meng, X.-L. (2024). A Conversation with Tim Ritchie, president of the Museum of Science, Boston. Harvard Data Science Review, 6(2). https://doi.org/10.1162/99608f92.dc0540dc

Yu, B., & Barter, R. L (2024) Veridical Data Science. MIT Press.


©2024 Xiao-Li Meng. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.

Comments
0
comment
No comments here
Why not start the discussion?