We would like to thank all the discussants for their thought-provoking discussions that have deepened and broadened what we offered in our article. We are pleased that numerous research areas and strategies for advancing data science discussed in our article were emphasized, and that additional areas and strategies were suggested in these discussions. Our rejoinder will echo and synergize some of these.
Yu emphasized the importance of solving real world problems using data science. Williamson made an excellent point that the value of data must be understood with the purpose and context in mind in order to improve the “data journeys.” Lo highlighted the importance of getting familiar with a domain to identify research opportunities and applying domain knowledge in data analytics. Zhao affirmed that data science is interdisciplinary and needs synergistic efforts from statistics, computer science, mathematics, informatics and domain science. We certainly share these sentiments and agree that data science research must embrace real-world problems and become impact-driven within the right context, as we stressed in our article.
Indeed, data science is empowered by research and development in methods, algorithms, theory, and tools that are deeply rooted in real world problems. Data science thrives by integrating domain knowledge in its research and by sharing generalizable principles learnt from diverse applications. It is imperative for data scientists to collaborate with domain scientists to ensure that the whole is greater than the sum of its parts, by leveraging complementary skills and by becoming sufficiently knowledgeable about the domain science of interest. We are glad that this point is shared by many discussants, including the early career discussants Frost, Goeva, Seaton, Stoudt, and Trisovic (FGSST). We echo Mukherjee and Richardson’s call for a “deeper embedding” of data science into other fields, including the early stages of study designs. This is arguably the most effective way for data science to play a central role in solving real-world problems.
We agree with several discussants on the importance of fair, transparent, and trustworthy data science. First, the need of making data credible and reproducible cannot be overemphasized, nor can be the need of separating information from misinformation. Williamson highlighted the trustworthiness of the data themselves, and Lo stated the importance of data knowledge and deep understanding of data for proper analysis and interpretation. The research community and government agencies, such as the National Institute of Health in the United States, called for the FAIR data principles (Findability, Accessibility, Interoperability, and Reusability). In addition, we second Zhao that data-related ethical and privacy issues are imperative.
Second, we need to further our commitment to advance our understanding of trust and mistrust in science, broadly construed, in the context of domain science and real-world problems, and develop statistical and computational tools and methods for making science reproducible and replicable. To this end, Yu proposed the Predictability, Stability, and Computability (PCS) framework for veridical data science. FGSST pointed out the importance of building trust with fair and transparent science so that the knowledge gained can be effectively communicated and justified not only to scientists but also to the general public.
Both statistics and computer science need mathematical foundations. The evolution of statistics and computer science has greatly benefited from mathematics through their journeys of becoming vibrant disciplines in recent history. Kolda made a convincing case that mathematics is needed for deeper understanding and better development of many research problems in data science. Mathematicians are part of the data science community and can bring more ideas, tools and rigor to data science. More mathematicians are encouraged to join the march and collaborate with others in the data science community to solve big problems and make future scientific breakthroughs.
Lo provided insightful industry perspectives on data science, especially on descriptive, predictive and prescriptive analyses. The value of academic-industry collaborations cannot be overemphasized in the digital era. Indeed, as data become increasingly central in decision-making and the rapid growth of data science has been exhilarating to many, we have an unprecedented opportunity for collaboration between academia and industry to maximize data science to its full potentials. The last few years have witnessed increasing research activities in industry and greater and deeper partnerships between academia and industry. Data science research is where researchers working on modeling and building data science ecosystems in business and science, for instance, can talk to each other, learn from each other, and advance together.
We are pleased to see several discussants bring up the importance of communication and similar soft skills. Given the multi-disciplinary nature of data science, it is imperative to develop strong skills in communication and the presentation of scientific findings to scientists and stakeholders in different fields. Other soft skills that help identify and formulate important real-world problems in a data science framework are also essential. We are pleased to see FGSST called for more opportunities to communicate with broader audiences and increased support and recognition for outreach efforts to enhance the impact of data science on the society. We second Mukherjee and Richardson’s shout out for greater efforts to effectively share knowledge and establish fora and mechanisms for improving exchanges and dissemination of knowledge across fields.
In view of the transdisciplinary nature of data science, the early career discussants FGSST raised an important issue on the promotion criteria suitable in the fields of data science. We would like to emphasize the importance of being open minded and forward-looking. For example, for statisticians and biostatisticians, besides research papers in traditional statistical and biostatistical journals, impactful methodological and applied publications in other journals, such as machine learning and domain science journals, should be valued. So should be influential tool and software development that promote the wider use of statistical and computational methods. In genetics and genomics, for example, method papers and open access software published in genetics and genomics journals are more likely to be used by genetics and genomic researchers and make an impact on real-world practices and discoveries. We echo FGSST’s emphasis on creating an inclusive environment where early career researchers with diverse backgrounds and skill sets are appreciated and supported to maximize their potentials.
We are glad that Agarwal focused her discussions on data science education. Training the next generation of data scientists is a top priority for all. Agarwal emphasized the need to train junior data scientists both in a core of transdisciplinary data science “principles” across mathematics, computer science and statistics, and in a broad set of tools for the “practice” of data science in diverse domains. We certainly agree with these points. Following the earlier discussions, we would also like to reiterate the importance of training in problem formulation and problem-solving skills, communication, and presentation skills in an interdisciplinary environment, as well as a positive attitude and a high aptitude for learning relevant domain science.
We would like to conclude by highlighting Zhao’s discussion of INSIGHTFUL data science; the ten-letter phrase INSIGHTFUL represents ten important aspects of data science. As statisticians, we see a bright future for statistics and data science as we recognize the critical importance of embracing new data challenges, deeply committing ourselves in relevant fields, and working closely with other researchers and specialists in academia, industry, or government. Research and education in statistics and data science have tremendous values. We call for a broader and deeper engagement of statisticians, computer scientists, and more generally data scientists to join forces and contribute more than ever to science and our global society.
Xuming He and Xihong Lin have no financial or non-financial disclosures to share for this article.
©2020 Xuming He and Xihong Lin. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.