I am grateful to the discussants for their thoughtful and philosophical comments on both my remarks about data science as a discipline and my list of 10 research challenge areas. The purpose of my writing this article was to spark exactly this kind of discussion.
The most gratifying remark is from early-career researchers, Shuang Frost, Aleksandrina Goeva, William Seaton, Sara Stoudt, and Ana Trisovic (FGSST), who write that the list of challenges enumerated by my and He and Lin’s articles “gives us enough opportunity to last our whole careers.” Case closed.
Complementing my list, Bin Yu and Victor Lo raise additional challenges. Yu points out that stability should be a metric throughout the data science life cycle, not just during analysis; and even with increased automation, human judgment is still needed at every stage of the life cycle. Lo describes the soft skills practicing data scientists need, and the benefits data science methods can bring to the economic sciences. Sach Mukherjee and Sylvia Richardson and Lo addressed the role of domain knowledge in data science. I especially like the discussion by Mukherjee and Richardson on “shallow” versus “deep” embedding of data science in other domains. I would imagine Tamara Kolda, who argues for mathematics as a language for tackling data science challenges, would also agree that a “deep” embedding of data science in other domains would be transformative for those domains.
Insightful comments by all discussants reinforced, enhanced, and elaborated on many of the 10 challenges I raised. Check out Hongyu Zhao’s INSIGHTFUL (pun intended) acronym. Kolda listed many topics in mathematics that serve as a scaffold for all sciences, not just data science. I would add one to her list: mathematical logic. Logic is foundational in computer science. Granted, it has yet to show its prowess for data science, but especially for trustworthy AI, I believe there is great potential.
Indeed, a theme that struck a nerve with many discussants is the subject of trust. Although I couched it in the context of AI, trust is important more generally for data science, starting with ‘Can we trust the data?’ and ‘Why should we?’ Trust in the methods, models, and tools of data science is paramount for the field and for the industry to flourish.
In terms of nurturing talent in the field, Shivani Agarwal, FGSST, and Mukhergee and Richardson all point out the importance of a new kind of education and training needed for data science, as it transcends existing disciplinary boundaries, and thus requires learning a multitude of skills and tools. This inherent multi-disciplinary nature of data science should not be an obstacle, but rather an inspiration for young people to enter the field. Who doesn’t want to be there in the beginning and get to shape a new field? More pragmatically, but no less importantly, Agarwal reminds us that a field cannot prosper without sustained long-term investments in research, e.g., through federal funding agencies like the National Science Foundation (NSF).
Edifying to me were the quotes from philosophers (Plato and Hume, in Yu’s piece; Confucius, in Kolda’s opening quote) and scientists from centuries ago (Hooke, Newton and his interaction with Flamsteed, in Robert Williamson’s piece). These philosophers and scientists put into historical perspective debates on questions like ‘What is data?’ let alone ‘What is data science?’
I point out the discussion by Williamson and FGSST on “What is the value of data?” as I almost included that question as a candidate “deep” question for data science in my article. Today, data is interpreted with respect to some purpose or context, and with respect to an end user. This perspective suggests that ‘value’ is a qualified, or even a subjective, measure determined by the user who seeks to extract some value from data. For example, to a scientist, value is new knowledge. To a business, value accrues to its bottom line. But, more fundamentally, is there a way to answer the question “What is the value of data?” without interpreting it with respect to a given dataset for a particular context of use? Can data be assigned some intrinsic value? Williamson reminds us that these are all very old questions whose answers we still ponder. Given the scientific and technological advances since centuries ago, especially the digitization of data of the 21st century, they are worth revisiting.
This leads me to the most provocative comment of all the discussants. After his discussion on the value of data, Williamson ends with a punchline: “Data is a process, not a thing.” To back his proposal, he cites recent papers whose authors come from the humanities, philosophy, medicine, science, and computing. (This viewpoint especially made me take pause because he cites my HDSR “The Data Life Cycle” paper (2019).) Viewing data as a process suggests data is a function with type, T1 → T2, over the more elemental types T1 and T2. What could those types T1 and T2 be? Do we not circle back to the philosophical question of what is at the bottom (fact, information, data, or something else)? I prefer a more prosaic view where data is a thing over which we operate; the data life cycle is a process that transforms data (values of T1) into actions (values of T2). I suspect we may still be pondering these questions in another hundred years.
Just to ground us again: the more immediate scientific and societal research challenges lie right in front of our nose. We need to make progress on them, producing concrete results, to advance the state-of-the-art in data science.
The author has nothing to disclose.
I am deeply grateful to Xiao-Li Meng for providing HDSR space for kicking off this discussion and enlisting members of the data science community to weigh in on what they see as challenges for us to pursue as we help define the field. He worked hard to get commentaries from: biostatisticians, computer scientists, mathematicians, and statisticians; academics and industry leaders; researchers and educators; and junior and senior scholars. HDSR is the perfect forum for initiating a discussion on a research agenda for data science. The breadth and depth of comments from the discussants are illuminating.
Wing, J.M. (2019). The Data Life Cycle, Harvard Data Science Review, 1(1).
This rejoinder is © 2020 by the author(s). The article is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the author identified above.