Skip to main content
SearchLoginLogin or Signup

AI (Agnostic and Independent) Data Science: Is It Possible?

Issue 5.3 / Summer 2023
Published onJul 27, 2023
AI (Agnostic and Independent) Data Science: Is It Possible?
·

Preparing a last-minute editorial to weave all articles in an issue into a meaningful theme has been my quarterly mental gymnastics, one that I anticipate just as I would an intense physical workout. I seek inspiration, the difference being merely food for thoughts verses thoughts for food. However, I never expected an interstellar inspiration, until receiving “The Count is Up to 360 Spherules!” (Diary of an Interstellar Voyage, Report 41, July 21, 2023), where my colleague Avi Loeb recounted that, “As the expedition’s chief scientist, I assigned the analysis of the samples to independent state-of-the-art laboratories and agnostic world experts in geology and meteoritic studies.”

Avi needs no introduction, so I will not supply one. However, some readers might wonder about the adjectives “independent” and “agnostic” if they have not read the first 40 reports (or the 1000 articles and essays) by Avi. Surely, we want the best from the Homo sapiens to analyze potentially alien-revealing interstellar spherules. But by qualifying “state-of-the-art laboratories” with independent, and “world experts” with agnostic, Avi is sending an unsubtle message that he trusts results from state-of-the-art laboratories or judgments of world experts only when they are free from external influence or preconceived notion.

Unless one has been living in an alien world, it would have been hard to be completely ignorant about the UFO (unidentified flying object)—now UAP (unidentified aerial phenomena)—controversy or conspiracy. The possibility of existence of intelligent being beyond the Earth fascinates and frightens many of us concomitantly. When a subject piques the curiosity of the entire Homo sapiens, and when no opinion can be ruled out mathematically, the ill effect of human creativity would be in full display.

I, for one, have troubled myself with two antithetical thoughts, especially when I was reading Avi’s writing. As a statistician, I work hard to warn others to avoid confirmation bias—it is just a bit too convenient to declare one’s own existence as the only possible existence in the cosmos. Or in the words of Avi, we need to adopt the principle of “cosmic modesty” to consider life is likely to exist elsewhere, and hence “we should search for it in all of its possible forms” (Loeb, 2017).

As a pragmatic Bayesian statistician, I also have struggled to place a non-trivial prior probability on something whose existence has never been demonstrated scientifically. (My apologies to all dogmatic Bayesians.) Avi’s concern about a non-agnostic “world expert” is well found here; if my prior is so low, then it would require extraordinary evidence from the data to bring my posterior probability to an appreciable level—that is, to change my opinions. Worse yet, if my prior probability is zero and I restrict myself within the Bayesian framework, then my posterior probability will remain zero. Hence, my opinions will never change regardless of data. This of course should concern anyone, except for perhaps those who desire to maintain the zero probability for whatever reasons or belief.

Before a reader becomes distracted by the thoughts of disqualifying the Bayesian approach as a scientific learning framework or me as a statistician, let me state the obvious: there’s no need for interstellar travel to encounter such concerns. From virus transmission and mask effectiveness to climate change, we now find ourselves enmeshed in politically driven interpretations rather than scientifically grounded responses. Worse yet, even those who are convinced by scientific evidence in private sometimes resort to ideologically driven interpretations in public for various reasons, none of which can effectively persuade those of differing ideologies.

Risking being tautological, a society will become progressively divided as trusted voices of persuasion wane. Regardless of one’s stance on whether data scientists can or should become a primary source of trusted persuasion, I trust that the vast majority of us can agree on the importance of preventing data and data science from disproportionately serving those in positions of power, whether in government, private sectors, or media.

In Data We Trust, Assuming the Data Are Trustworthy …

I said I trust the vast majority not merely to be true to my profession (since virtually nothing is 100%), but because of reality. Jonathan Auerbach’s (2023) article, “Safeguarding Facts in an Era of Disinformation: The Case for Independently Monitoring the U.S. Statistical System,” reminds us of the uncomfortable truth of bad actors within political factions, special interest groups, and media. As Auerbach candidly alerts us, the attempts to interfere with federal statistics, as ancient as the system itself, now find their incentives amplified to a new height.

The unsettling rise in the politicization of U.S. federal statistics should be a grave concern for anyone who values the adjective united in the United States. Such politicization propels a vicious cycle: more politicization begets less public trust, which, in turn, offers greater incentive—and eases the path—for those with nefarious intents to sow seeds of doubt. Auerbach (2023) underscores this, reminding us that creating “doubt in the accuracy of the data, doubt in the value of making them accessible, or doubt in the value of collecting them in the first place” can have similar effect as tampering with the statistics themselves. Indeed, we have all borne witness to how political parties and media outlets exploit the label of ‘disinformation’ (or ‘misinformation’) to delegitimize information they wish to conceal from public scrutiny, irrespective of whether the act of labeling is itself a disinformation tactic.

To combat the escalating threats to the integrity of federal statistics, the American Statistical Association and George Mason University are leading new effort to monitor the health of the U.S. statistical system independently and proactively, as outlined in Auerbach’s (2023) article. This task is undeniably Herculean, particularly when it stands to inconvenience those in power with questionable motives. However, the challenges may proliferate when these efforts intersect with the actions of the so-called ‘good actors.’ These are individuals who consider manipulating data and public opinion as justifiable if it serves what they perceive as a noble cause, overlooking the dangerous precedent such an action sets. Eroding the independence and, subsequently, the trustworthiness of any statistical system carries a hefty price, particularly in the long term. An undermined system is perpetually susceptible to influence or manipulation by those with vested interests, with most ready to furnish ‘good causes’ to vindicate their actions.

Ensuring the full independence of a governmental statistical system therefore remains an evergreen aspiration, necessitating continuous and renewed efforts from each successive generation. We should thus harbor deep gratitude for those engaged in these and related endeavors. As Teresa Sullivan (2020) eloquently posits in her article, “social statistics underpin our democracy (and republic).” However, without trustworthy federal statistics, this underpinning threatens to morph into an undermining force, eroding the very foundations of our society.

Data Science Without Boarders, Assuming We Can Plug in Data Anywhere …

The article “Critical Data Challenges in Measuring the Performance of Sustainable Development Goals: Solutions and the Role of Big-Data Analytics,” penned by Mehrbakhsh Nilashi, Ooi Keng Boon, Garry Tan, Binshan Lin, and Rabab Abumalloh (2023) vividly reminds us of the crucial role that reliable data play in the intricate framework of our global society, transcending the boundaries of individual nations and cultures. Halfway through the 15-year sustainable development plan, initiated in 2015 by 193 United Nations member states, the article reminds us that we are in the midst of a monumental quest to accomplish 17 Sustainable Development Goals (SDGs) and their corresponding 169 targets by 2030. These goals were crafted to be specific, time-bound, and measurable—a noble and well-intentioned endeavor. However, the colossal task of reliably and consistently tracking the progress (or lack thereof) of such a vast number of targets, measured through 230 indicators, across our global human ecosystem—a maze of starkly disparate political, social, and cultural systems—presents a level of challenge that is nothing short of unprecedented.

Crucially, the availability of reliable data is not merely beneficial for assessment but integral to the success or failure of the SDGs. After all, to make impactful and strategic decisions, countries require data they can trust. As the authors astutely warn, “Without data—especially high-quality data—sustainable development is doomed to falter” (Nilashi et al., 2023). The prospect of relying on “big data” is fraught with risks, given that much of such data is collated without the necessary design considerations or quality control measures needed for valid learning and inference.

For this reason, I particularly commend the authors’ thorough SWOT analysis, as well as their repeated emphases and discussions on data quality throughout the article. The grandest of challenges necessitate the boldest of approaches, and this article offers many provoking thoughts and inspirations for explorations and actions. Moreover, it should be an enlightening reading for anyone, especially students who look to delve into the intricacies of the data science ecosystem, to gain a deeper appreciation of its grand complexity and unlimited inspirations and opportunities for explorations.

My own explorations included attending, shortly before the COVID-19 pandemic, a workshop organized by the United Nation on building a global data commons. In retrospect, the timing of this gathering was both opportune and untimely. Opportune, as the pandemic glaringly exposed the deficiencies in our worldwide society regarding the absence of universal protocols and standards for data sharing. The struggle to even share data between neighboring towns in order to formulate effective strategies to jointly combat the virus—which clearly ignores geographical or political boundaries—highlighted these deficiencies. Yet, it was also untimely, as no suitable infrastructure was in place when the pandemic cast a global shadow, and to my knowledge, we still lack a universally accepted ‘data converter’ for navigating our global digital landscape.

The immense complexity of creating a global data commons was immediately apparent, as the brainstorming session I joined dedicated more time to determining an appropriate type of data for prototype testing, than to establishing the guiding principles for constructing the prototype itself. Naturally, several kinds of critical societal data were quickly dismissed due to their high significance. The reluctance, or outright refusal, from many countries and organizations to participate in testing if the data involved were financial, medical, political, social, or technological is easy to predict. This led to a deep dive into searching for a type of data that is useful to all but proprietary to none, until someone looked outside of a window. Weather data—we found it!

Before my returning from my sabbatical, I had recounted this anecdote numerous times, highlighting it as an instance of finding a unicorn. Readers probably then could imagine the (ChatGPT-free) hallucination I experienced, when I was informed, upon returning, of a forthcoming article titled “Looking at the Weather: The Politics of Meteorological Data“ by Elaine LaFay (2023). My first non-hallucinating reaction was ‘Oh, of course, climate change.’ But the article is in the Mining the Past column, and the historical tale it supplied was long before global warming was even a term. No overview here to not ruin the readers’ joy of puzzle solving, other than a reading preparation: when it rains, it can pour.

We Are All Responsible for Data Quality Degradation

The degradation of our data collection process is as enduring as our urge to accumulate data, echoing Auerbach’s (2023) emphasis on the historical interplay of federal statistics. Human actions (or inactions) have primarily been the source of such degradation, although they are not always politically motivated or malevolent. This only exacerbates the issue, however, as there are significantly fewer safeguards against such influences.

For instance, each refusal to take part in a probabilistic survey adds to this erosion. As a professional statistician specializing in identifying and reducing non-response bias, I can attest to this. Yet, I am as culpable as anyone. We are all perpetually preoccupied. (Should any reader find this claim offensive, congratulations are in order, but also a reminder that it would take 14 months to read everything HDSR has published if one reads one article daily.)

Given the escalating need for data collection and, consequently, the increasing number of survey requests, Michael Bailey’s (2023) assertion in his article, “A New Paradigm for Polling,” that “random sampling is, for all practical purposes, dead” is understandable. The aspect declared ‘dead’ is not the sampling itself but its randomness, the foundation for most statistical analysis theories and methods. My refusal to respond to a survey is never the outcome of flipping a coin or rolling dice. Regardless of how randomly I was selected to participate, my non-random refusal shatters the elegant theory accompanying the random invitation, particularly when my busyness—often the reason for my refusal—is indicative of the data values the survey intends to collect. For instance, a survey evaluating sleep patterns in a population would portray an inaccurately sleepier population when me and my sleep-deprived peers are too sleepy to respond.

I extend my gratitude to Michael Bailey for shedding light on my work on quantifying such data degradation when estimating population averages, like election voting shares. But above all, I appreciate his effort in raising awareness about the severe data quality issues in sample surveys with his forthcoming book, Polling at a Crossroads: Rethinking Modern Survey Research (Bailey, 2024), which serves as the basis for his current (2023) article. I am immensely thankful to all the reviewers and discussants, ranging from ecology to physics, for their critical review and diverse comments, which also serve as a counterbalance against my own bias, given my vested interest in this research field.

Data Science Quality Control at Every Step

Quality data is the cornerstone of quality data science, but it’s just the beginning. We often espouse the mantra ‘let the data speak’ to convey an image of unbiased, unaltered truth. However, data rarely articulate themselves clearly or even coherently without interpretation. Thus, the individual or team responsible for this interpretation or translation becomes pivotal. We’re familiar with the concept of something being ‘lost (or gained) in translation,’ and countless individuals have been either beneficiaries or casualties of such a process, much like participants in forex trading.

Translating data into knowledge or decisions is a lengthy process, encompassing many steps (and typically multiple individuals or teams). Some tasks, such as data entry and coding, are often viewed as mechanical, while others, like data analysis and interpretation, are deemed more analytical. Despite requiring different skills, each step is crucial in ensuring the overall quality of a data science project. A windshield wiper might seem insignificant compared to an automobile’s engineering, yet a malfunctioning wiper can render the vehicle useless in snow or even cause a fatal accident.

For this reason, I invite readers to delve into the insightful article by Randall Pruim, Maria-Cristiana Gîrjău, and Nicholas J. Horton (2023) titled “Fostering Better Coding Practices for Data Scientists.” This is a good read for anyone passionate about quality data science because it emphasizes the importance of principled coding in quality data science practice. The authors aren’t being dramatic, but are instead shedding light on an aspect that’s generally overlooked, except by those who’ve learned its significance the hard way.

During my PhD studies, I spent an entire week debugging why my results were overly optimistic, only to discover that it was due to my Chinglish. In one instance I had mistakenly typed “sigama” instead of the variable “sigma,” which I used to pronounce (and occasionally still do) as “See-ga-ma.” My ad hoc way of setting variables then led the program to treat “sigma” as zero, which of course was precisely wrong (pun intended to test readers’ “statistionarity”). My then-lack of debugging training cost me a week, but it imparted a lesson I’ll carry forever.

Indeed, with the correct training, such as the one delineated in Pruim et al.’s (2023) article, we can circumvent such harsh lessons or, at the very least, avoid a week of debugging due to an extra ‘i.’ More critically, principled coding is a cornerstone of ensuring reproducibility, a fundamental stepping stone towards high-quality, reliable data science. Would anyone trust a study if its results could not be reproduced when following the exact procedures outlined by the authors?

In collaboration with the National Academies of Sciences, Engineering, and Medicine (NASEM), HDSR previously published a special theme in 2020 titled “Reproducibility and Replicability.” Assuring reproducibility and replicability requires ongoing efforts. In light of this, I am indebted to Lars Vilhuber for undertaking the creation of a new column for HDSR: Reinforcing Reproducibility and Replicability. This column premieres in the current issue with a special theme comprising eight insightful essays (Ball, 2023; Butler, 2023; Guimarães, 2023; Hoynes, 2023; MacDonald, 2023; Mendez-Carbajo & Dellachiesa, 2023; Salmon, 2023; Whited, 2023). Interspersed with highlights from these broad-ranging topics, the special theme editorial by Lars Vilhuber, Ian Schmutte, Aleksandr Michuda, and Marie Connolly (2023) details the motivation and future plans for this column. It aims to provide a platform for interdisciplinary dialogue on enhancing the reliability and trustworthiness of (data) scientific research, spanning topics from data provenance documentation and replication packages to the education of researchers and students.

This focus on education resonates with a call to action by Marcia McNutt, President of NASEM, in her article “Self-Correction by Design” from the 2020 special theme. She compellingly asserted that the scientific community must “adopt an enterprise-wide approach to self-correction that is built into the design of science.” A critical first step involves educating students “to perform and document reproducible research, sharing information openly, validating work in new ways that matter, creating tools to make self-correction easy and natural, and fundamentally shifting the culture of science to honor rigor.” Realizing McNutt’s vision may span generations, given the lack of such training in our current educational systems worldwide. However, it is indispensable because, without cultivating a culture that respects research rigor from the ground up, preserving the trustworthiness of (data) science will perpetually be a strenuous endeavor.

Trust in any system originates from within, requiring its creators and stakeholders to believe in and invest in its reliability, motivated by personal conviction and perseverance. This notion resonates with the saying that integrity is doing the right thing even when no one is watching. Quality construction implies attentiveness to every step, necessitating every team member to embody the mindset of a quality builder or, at a minimum, a quality overseer. While it’s an ambitious demand, given humanity’s propensity to take shortcuts, setting high standards is vital to ensure an acceptable quality level, whether we’re building a trusted brand or reliable (data) science.

Is Agnostic and Independent Data Science Even Possible?

At first glance, the answer may seem a resounding ‘no.’ Data scientists are human, after all, dependent on pre-existing beliefs to function and intricate social interactions to thrive. However, findings drawn from agnostic and independent data science inquiries hold the greatest potential to be universally convincing, almost axiomatically. As such, this level of objectivity should always be our aspiration, especially for those committed to maintaining data science as a scientific venture, a tool for compelling reasoning, and ultimately, a catalyst for societal harmony.

Embracing this aspiration does not necessarily obligate data scientists to consistently maintain a neutral stance, a condition that’s neither feasible nor beneficial. As individuals, we maintain the privilege of personal thought and beliefs, a freedom that warrants universal respect. However, when presenting and interpreting our data science findings that are shaped by our personal value system or ideology, we should precede our views with a disclaimer. This disclaimer should clarify that our views and judgments are expressed in a personal capacity, and they do not constitute our profession opinions one, akin to the disclaimers given by employees (e.g., government officials) when they present or publish content that isn’t on behalf of their employers. Implementing such a practice is logistically trivial but adopting it will necessitate a collective effort to foster a cultural change.

Culture building or changing requires time, determination, and education. In my conversation with Mercè Crosas (Crosas & Meng, 2023), part of HDSR’s ‘Conversation with Leaders’ series, we envisioned an more ambitious plan to nurture future leaders in data science and society. This plan involves young talents sequentially immersing themselves in academia, industry, and government, perhaps for periods of five years each. This unprecedented journey aims to amalgamate insights from these diverse sectors, thereby empowering leaders to adopt a comprehensive, empathetic if not agnostic, and adaptable problem-solving approach.

Crosas’s remarkable career, marked by leadership roles in all three sectors, undoubtedly benefited from her diverse experience. It’s hard not to be inspired by her open-mindedness and empathy after engaging in conversation with her. Would we all be better off and happier if our scientific communities and general societies have more leaders of her caliber and empathy?

(For anyone interested in developing and investing in such a multi-sector career path program, we encourage you to reach out to us!)


Disclosure Statement

Xiao-Li Meng has no financial or non-financial disclosures to share for this editorial.


References

Auerbach, J. (2023). Safeguarding facts in an era of disinformation: The case for independently monitoring the U.S. statistical system. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.5cc8971c

Bailey, M. (2023). A new paradigm for polling. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9898eede

Bailey, M. (2024). Polling at a crossroads: Rethinking modern survey research. Cambridge University Press.

Ball, R. (2023). “Yes We Can!”: A practical approach to teaching reproducibility to undergraduates. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9e002f7b

Butler, C. R. (2023). Publishing replication packages: Insights from the Federal Reserve Bank of Kansas City. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.aba61304

Crosas, M., & Meng, X.-L. (2023). A conversation with Mercè Crosas. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.712d4124

Guimarães, P. (2023). Reproducibility with confidential data: The experience of BPLIM. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.54a00239

Hoynes, H. (2023). Reproducibility in economics: Status and update. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.80a1b88b

LaFay, E. (2023). Looking at the weather: The politics of meteorological data. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.dab5e00b

Loeb, A. (2017, June 28). The case for cosmic modesty. Scientific American. https://blogs.scientificamerican.com/observations/the-case-for-cosmic-modesty/

Loeb, A. (2023, July 21). The count is up to 360 spherules! [Diary of an Interstellar Voyage, Report 41]. Medium. https://avi-loeb.medium.com/the-count-is-up-to-360-spherules-314b23f9333a

MacDonald, G. (2023). Open Data and Code at the Urban Institute. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.a631dfc5

McNutt, M. (2020). Self-correction by design. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.32432837

Mendez-Carbajo, D., & Dellachiesa, A. (2023). Data citations and reproducibility in the undergraduate curriculum. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.c2835391

Nilashi, M., Keng Boon, O., Tan, G., Lin, B., & Abumalloh, R. (2023). Critical data challenges in measuring the performance of sustainable development goals: Solutions and the role of big-data analytics. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.545db2cf

Pruim, R., Gîrjău, M.-C., & Horton, N. J. (2023). Fostering better coding practices for data scientists. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.97c9f60f

Salmon, T. C. (2023). The case for data archives at journals. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.db2a2554

Sullivan, T. A. (2020). Coming to our census: How social statistics underpin our democracy (and republic). Harvard Data Science Review, 2(1). https://doi.org/10.1162/99608f92.c871f9e0

Vilhuber, L., Schmutte, I., Michuda, A., & Connolly, M. (2023). Reinforcing reproducibility and replicability: An introduction. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.9ba2bd43

Whited, T. (2023). Costs and benefits of reproducibility in finance and economics. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.63de8e58


©2023 Xiao-Li Meng. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.

Comments
1
Jack Ryan:

Perfect one, thanks for sharing it on pubpub