In his article titled “Data Science at the Singularity,” David Donoho (2024) discusses the emergence of two pivotal trends influenced by the evolving concept of the data science singularity. This concept embodies the substantial progress and widespread acceptance of three fundamental data science principles: data sharing, code sharing, and challenge sharing. Donoho’s analysis sheds light on the transformative implications of these trends for the field of data science.
Emergence of “frictionless research exchange” (FRX): This marks a new era in computation-driven research characterized by seamless data sharing, code sharing, and competitive challenges. It fosters iterative experimental modification and improvement, leading to remarkable advancements, particularly in fields like empirical machine learning.
“Frictionless reproducibility” (FR): The maturation of data science initiatives, particularly emphasizing the three key principles, has facilitated frictionless reproducibility in computational research. This represents a departure from previous practices of in-principle reproducibility, offering a more streamlined and trustworthy method for verifying and expanding upon others’ research findings. Donoho suggests that FR is now everywhere, signifying the acceleration of data science toward singularity.
If we analogize the evolution of multidisciplinary data science to a journey around the globe, the data science singularity envisioned by Donoho would serve as the destination of this journey. This metaphor suggests that the concept of the data science singularity may be approached from two perspectives: a relatively static attainable viewpoint and an inspiring moving target viewpoint.
Donoho’s perspective leans toward the static side, characterizing the data science singularity as the convergence of perceived efforts in digitization, research code sharing, and adapting challenge problems. He views data science as maturing, with transformative influence and approaching singularity, particularly focusing on research and separately considering the impact of AI. However, this leaves room for debate and expansion, concerning how a post–data science singularity world may look and how the evolution of AI may reshape the data science singularity as it stands now.
Given data science’s multidisciplinary nature, it is crucial to critically assess the character of a singularity, the scope of data science, the landscape we are navigating, and the impact of new technological developments, as all of these factors can affect the interpretation of the data science singularity.
Specific data, code, and idea sharing are typically constrained by their relevance and bounded within specific communities, as Donoho primarily focuses on the research community. Therefore, it is reasonable to conclude that the data science singularity can vary and be achieved in different situations and fields. By this perspective, we have indeed reached the data science singularity many times in both business and research domains before, providing us with empirical experiences of situations not only pre- but also post- a data science singularity, largely relevant to a certain community.
For instance, in the late 19th century, the rise of department stores and mass retailers led to a pioneering initiative in data, usage (code), and challenge sharing. By sharing customer information with local credit bureaus, businesses aimed to attract consumers, manage repayment, and mitigate risks collaboratively. This initiative laid the groundwork for the modern credit scoring system, which stands as one of the most impactful data science applications of our era and likely exceeded its originators’ wildest expectations. To our ancestors, the notion of data, analysis, and action being shared across businesses beyond retails worldwide via the current highly specialized credit bureaus would represent the world post- the data science singularity of their imagination and definition. In this sense, a relatively static singularity defined by our predecessors was achievable and has indeed been attained at a certain point in history. A general description of how modern credit scoring system works can be found in Pritchard (2022).
Looking beyond the credit scoring industry, similar mission-specific data, code, and idea sharing, or data science singularity similar to what Donoho is excited about in research, have also been achieved in various fields through business, government, and/or research initiatives. For instance, a global leading data and analytics company, Nielsen, developed a specialty in media data collection and audience measurement instead of consumer credit scores. Another American company, Gartner, built a technological research and consulting empire leveraging its strengths in broad and timely business benchmarking. As a well-known federal government agency, the Centers for Disease Control and Prevention coordinated numerous cross-disciplinary or public domain data initiatives for research and surveillance, one of which is NHANES (National Health and Nutrition Examination Survey) for national health and nutritional status studies and understanding. These national or international scoped data, analytics, and ideas sharing initiatives resemble Microsoft’s GitHub for software development and resemble the credit bureaus for consumer national credit scoring system in their respective domains and have met some preset singularity visions. They have all facilitated and enhanced frictionless solutions to a high level relative to ages before.
Donoho insightfully depicts a research world underpinned with the emergence of FRX and FR as data science approaches yet another singularity in its development journey. By framing the concept of the data science singularity, Donoho invites us to contemplate the transformative shifts occurring in the field. As researchers of our generation, we are fortunate to witness and experience the unfolding of this paradigm shift, which promises to redefine how we conduct and disseminate scientific research. Donoho’s comprehensive exploration of this concept provides valuable insights into the evolution of data science and its potential implications for future research endeavors.
If a journey is not meant to end, like the evolution of data science, new destinations will emerge one after another as the road extends ahead. Under this perspective, the destination, or singularity, becomes a moving target with evolving characteristics that inspire us to continuously expand the journey to terrains that have not been recognized before.
This notion is reflected in all the aforementioned data science singularity achievements in various fields. Due to the continuous expansion of their knowledge bases and exploration scope, unforeseen challenges emerge. The focus of explorers is consistently drawn to new manifestations of singularity.
Taking the modern credit scoring system as an example, while it allows businesses and individual consumers to instantaneously, or frictionlessly, obtain credit scores, identity checks, and other available services, the industry has continuously been challenged by emerging issues such as broadened data and quality demands, exploding big data availability, heightened privacy regulations, evolving ethical standards, and so on, striving to reach another data science–related singularity to its specifications. The implication of this example is that forward-looking beyond the approaching destination, singularity becomes a best practice for explorers.
From this viewpoint, while cheering for the approach of the data science singularity in research as Donoho brought to our attention, signs of the next singularity have already been looming in some ways and waiting for someone to discover and depict, as Donoho did.
Considering the potential characteristics of the data science singularity post-Donoho’s vision prompts us to contemplate the evolving landscape of the field. For instance, the recent proliferation of generative AI (GAI) applications illustrates the revolutionary potentials and introduces novel complexities to research and technology domains, including data science.
With the integration of GAI with natural language processing and large language models (LLMs):
They enable comprehensive analysis of unstructured data, particularly in the forms of human languages. This disruption not only challenges traditional methods of collecting and managing data in structured tables or databases but also opens a new realm of data sources captured or capturable in human languages, whether written or spoken, for analysis. GAI made it possible to comprehensively analyze sentient information, such as tastes, smells, feelings, emotions, beliefs, cultures, and so on, which were previously in the blind spots of traditional analytics.
They challenge the statistical analysis philosophy of relying on sample data for population inference. Instead, GAI analyzes and characterizes a population with LLMs for making inferences on samples or prompts.
This foundational expansion not only redefines the scopes of data, types of code, and nature of challenges but also demands novel scientific innovation to sustain and optimize its efficiency and effectiveness. Nvidia’s seemingly endless expansion of computing chip power is a costly technological solution to meet the needs of early stage AI growth but it is hardly sustainable financially, or optimal intellectually, for human societies.
Moreover, AI in general also requires scientific guidance in tackling unprecedented challenges such as ethical usage, societal impact, and risk prevention. If we believe that data science should play a role in addressing some, if not all, of these demands, the next data science singularity becomes conceivable as our understanding of GAI grows. As GAI is not a replacement for traditional analytics philosophy and techniques, this hypothetical new data science singularity will not replace the one envisioned by Donoho.
This notion leads us to a brief discussion next on the definition of data science.
Undoubtedly, the notion of the data science singularity hinges upon how we define the very essence of data science itself. Despite its growing prominence and undeniable influence, the precise definition and scope of data science remains the subject of ongoing debate and interpretation.
Meng (2019) offers a compelling perspective, conceptualizing data science as “An Artificial Ecosystem,” “a human construct that depends critically on computing advances.” This viewpoint underscores the interdisciplinary nature of data science and its pervasive impact across a diverse array of fields.
Various scholars have emphasized the multidisciplinary nature of data science in different ways. Wu (1997) advocates for renaming statistics as data science, emphasizing its potential to augment existing domains where statistics are applied. Cao (2017) provides a comprehensive formulation of data science as an interdisciplinary field encompassing statistics, informatics, computing, communication, sociology, and management. Striking a more modest tone, Weihs and Ickstadt (2018) frame data science as a scientific discipline influenced by various domains, highlighting the centrality of statistical methods in its foundational processes.
In my own work (Zhu, 2024), I propose a nuanced perspective on data science, portraying it as a distinct scientific discipline focused on facilitating the responsible and optimal growth of AI. This definition assigns a clear and a tangible identity to data science, akin to Tesla’s positioning as a technology innovator within the automotive industry. It underscores data science’s role in addressing the emerging challenges and broad applications brought about by AI, echoing the sentiments expressed by Meng (2019). Drawing upon insights from established analytics sciences such as actuarial science, biostatistics, econometrics, and epidemiology, which have flourished by dedicating themselves to applied fields like insurance, pharmaceuticals, economics, and public health, I argue that a focused definition of data science centered on AI not only enhances its visibility to the public but also is essential for its sustainable growth and relevance.
Such a definition not only broadens our perspective on Donoho’s data science singularity but also acknowledges the interconnectedness of his envisioned data science and AI singularities. It invites us to explore the symbiotic relationship between these two transformative forces and the profound implication for the future of computational research.
In summary, Donoho has insightfully brought our attention to the approaching data science singularity, characterized by data sharing, code sharing, and competitive challenges, and its benefits in the forms of frictionless research exchange and frictionless reproducibility. His article also brought us an opportunity to review some data science singularities that have been achieved in various fields and to forward-look for emerging data science singularity that may be shaped in accordance with the rapid development of AI.
Zhiwei Zhu has no financial or non-financial disclosures to share for this article.
Cao, L. (2017). Data science: A comprehensive overview. ACM Computing Surveys, 50(3), Article 43. https://doi.org/10.1145/3076253
Donoho, D. (2024). Data science at the singularity. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.b91339ef
Meng, X.-L. (2019). Data science: An artificial ecosystem. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.ba20f892
Pritchard, J., (2022, March 28). How credit bureaus work and what they do for lenders. The Balance. https://www.thebalancemoney.com/how-credit-bureaus-work-315540#:~:text=Types%20of%20Information%20Credit%20Bureaus%20Collect%201%201.,record%20of%20it.%20...%204%204.%20Tradelines%20
Weihs, C., & Ickstadt, K. (2018). Data science: The impact of statistics. International Journal of Data Science and Analytics, 6, 189–194. https://link.springer.com/article/10.1007/s41060-018-0102-5
Wu, C. F. J. (1997). Statistics = data science? [unpublished]. Georgia Institute of Technology. http://www2.isye.gatech.edu/~jeffwu/presentations/datascience.pdf
Zhu, Z. (2024, January 17). Data science: Connecting the past and pioneering the future of analytics. Analytics. https://pubsonline.informs.org/do/10.1287/LYTX.2024.01.05/full/
©2024 Zhiwei Zhu. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.