We want to thank the Harvard Data Science Review for organizing this stimulating and timely discussion. In reading the contributions from the various discussants, we perceive a great deal of excitement, which matches and further amplifies our own. It is clear that something new is afoot and that these conversations are only beginning.
We wish to note at the outset that we have little patience for a ‘one size fits all’ approach to the design of data science education. In particular, the Berkeley model, as broad as we have endeavored to make it, reflects our own local constraints, including its particular demographics, the likely career paths of our students, and the available resources. Different environments will require different configurations. See the discussion by Alan Fekete for more on this issue, along with useful pointers to additional commentary. As this commentary makes clear, in data science we have a collective opportunity for an experientially-driven education on education.
Hopefully, the result of these experiments will be the crystallization of a few core concepts, so that the students emerging from various data science curricula will possess a lingua franca that allows them to interact and communicate broadly throughout their professional and personal lives. As befits the grand scope of data science, those core concepts will not necessarily be those of a single discipline, either an existing discipline or a new one. Indeed, we are not convinced that the phenomenon of data science reflects the emergence of a new academic discipline, as suggested by Michael Franklin and Dan Nicolae. Rather, we think that the right scope is akin to the emergence of the liberal arts in centuries past, with its core components of logic, grammar, and rhetoric, or the emergence of engineering, with its foundations in physical laws and principles of analysis and design.
In making this assertion, we're avoiding any attempt to define ‘data science,’ and we are not even hewing to the phrase ‘data science’ as the best phrase to capture the underlying historical trends. Indeed, as we suggested in the introductory paragraphs of our article, we see these trends as arising from a century’s worth of development of the computational, inferential, informational and social sciences, writ large, and from the ongoing experience of humans who are living with computers in their midst. The current era does add something qualitatively new to the mix—the omnipresent availability of data, and, most notably, the granular nature of that data. Whereas previous generations were in possession of data about general phenomena, we are now in possession of data about specific phenomena. For example, in genomics we have data about each individual gene, in astronomy we have data about each region of the sky, in medicine we have data about each tumor, and in social science we have data about individual humans. Our era is about ‘data’ and about ‘specific context.’ In that sense ‘data science’ is an appropriate and useful terminology for capturing current trends.
Grand intellectual traditions such as the liberal arts and engineering were not focused solely on academic organization and specific conceptual frameworks, but they were also concerned with preparing students to engage with the outside world. Indeed, that was their primary concern. Data science meets this criterion for a grand intellectual tradition. Indeed, as emphasized by Alison Gibbs and Nathan Taback in their discussion of experiential learning and problem-solving, and as emphasized by Michael Franklin and Dan Nicolae with their focus on communication and ethics, data science education is mainly about preparing students to tackle emerging challenges in the outside world. Those challenges are vast, and they will take on new forms during the lifetimes of our students.
Several of the discussants made thoughtful comments regarding material that they felt is not sufficiently present in the Berkeley curriculum. Examples include Alan Fekete’s suggestion that more focus is needed on data storage and management, Dustin Tingley’s related suggestion regarding data wrangling, and Brian Kim’s request for deeper engagement with social science. Our first instinct in the face of such comments was to adopt a defensive posture—noting in particular that Data 100 has significant coverage of distributed systems and database concepts and noting that Data 102 presents basic social science concepts—but our second thought was a better one, which is to note that the Berkeley curriculum is very much a work in progress. In particular, colleagues Joe Hellerstein and Aditya Parameswaran are currently designing a data engineering class to augment our data science curriculum. Colleague Rediet Abebe is planning a computational social science class. We're excited to see data science providing a platform on which ideas from full-fledged intellectual disciplines such as these can be conveyed in new ways for new kinds of students.
As for Alfred Spector’s suggestion that optimization, and the perspective of operations research, is not sufficiently present in our curriculum, here we do feel compelled to go on the defensive. First, Data 100 uses optimization throughout, for example in the guise of empirical risk minimization as a unifying inferential framework, and in the use of gradient methods to derive the corresponding algorithms. Data 102 brings in multi-armed bandits, the minimization of false-discovery rates, and the optimization-over-time concepts of reinforcement learning. Moreover, our curriculum is informed by control theory, which we view as a union of statistical modeling and optimization. For example, our presentation of reinforcement learning emphasizes its roots in optimal control theory.
Second, we want to ensure that students are aware that an over-reliance on optimization tools, and on classical operations research perspectives, can be problematic. As we discuss in our classes, the naive optimization of click-through rate in modern information technology has arguably played a significant role in the rise of misinformation in automatic news feeds over the past decade. In designing systems that optimize a criterion that has social consequences of this kind, we want students to be aware of those consequences, and able to design around them.
If we had to name one discipline that is not sufficiently present in our data science curriculum, it would not be optimization; rather, it would be economics, including game theory, mechanism design, and behavioral economics. (Hint, hint, to our colleagues in the Department of Economics...). See Dustin Tingley’s discussion for more on this point, and see also the third author's article, “Dr. AI or: How I Learned to Stop Worrying and Love Economics,” published in the Harvard Data Science Review in 2019.
We were excited to learn more about the courses that have been designed at Olin/Harvard, as discussed by Allen Downey, and the iNZight project discussed by Chris Wild, Tom Elliott, and Andrew Sporle. There seem to be many commonalities among their perspectives and ours. The iNZight project has gone further than we have gone in the direction of visualization, and we admire their focus on "general-empowerment strategies." That said, there are a pair of dichotomies that they set up which do not entirely resonate with us. First, they state "Our focus has been on conceptual development via unifying dynamic-visualizations driven by a graphical user interface (GUI) […] rather than on randomization as a computational device." We question whether this should be viewed as either/or. GUIs can help humans not only to develop a clearer understanding of data, but also an understanding of randomness—of data that could have arisen but didn't. Randomness is too often reduced to a mere number, for example a P-value. We think that students would be better served by the creative use of visualization to help appreciate the nature of randomness, so that they can learn to appreciate how data lays in a bed of randomness.
Second, while we agree in principle with their assertion that "the data world needs many more skilled tool users than it does tool builders," we again wonder if this is the right dichotomy. There needs to be a third kind of person—a tool interpreter. It's not enough to build interfaces with buttons that people can easily push, unless they can fully and deeply understand what they see when those buttons are pushed. There is a big gap between GUI plots of data and the outputs of analyses of variance, survival analyses, fits of generalized linear models, etc. We need to train people who feel comfortable living in that gap.
Finally, in the spirit of the culinary presentation by Aleksandrina Goeva, Peyton Jones, Sara Stoudt, and Ana Trisovich, it’s time to turn to dessert. As those authors note (in agreement with Yolanda Gil), our connector model is a cherry on the top of our data science cake. The secret sauce in setting up this model—which involved the seemingly impossible task of setting up an academic environment in which several dozen faculty from around campus receive appropriate credit for participating in a new major that is administered by a separate campus unit—was to ensure that young faculty were included. Indeed, we often found willing partners among the assistant professors across our campus, several of whom were already skilled in data science and eager to bring it to their own departments. Their enthusiasm was such that many of them were willing to simply design and teach their connector course on a volunteer basis. We helped to incentivize such socially generous contributions by providing a community where they could share ideas, resources, and collegiality. In some cases, department chairs were infected by the enthusiasm, and, desirous of having their department not be left behind, they supplied teaching credit to these young faculty. Over time, this model built up credibility and strength across campus.
This credibility was one of the key factors that helped to support, and to shape, the evolution of Berkeley's Division of Computing, Data Science, and Society (CDSS), as discussed in Jennifer Chayes's article in the current issue of the Harvard Data Science Review. The evolution has become a co-evolution, with CDSS supporting the education program while engendering new research initiatives that include many of the young faculty who have contributed to the education program. While the overall structure is a familiar one on the academic landscape, bridging education and research, the connections are new ones, and the relevance to emerging real-world problems is striking.
We again thank the discussants, and the journal, and we look forward to continued dialog on data science education—one of the most exciting developments in modern academia.
This discussion is © 2021 by the author(s). The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.