As data-powered systems are present in every aspect of our lives, universities have a responsibility to prepare students to partake in the data workforce, to be informed on privacy and ethics of data, and to be inspired to reinvent the future of data. The University of California at Berkeley has been a pioneer in establishing data science offerings for all its undergraduates. In addition, this has initiated important transformations to the campus culture, both in terms of getting established faculty in all disciplines more involved with data science and giving students opportunities to gain sophisticated competencies in data science. The article “Interleaving Computational and Inferential Thinking: Data Science for Undergraduates at Berkeley” by Adhikari et al (2021, this issue), describes the approach to data science education taken at Berkeley, and the growth of its successful undergraduate data science program in the five years since inception. In this article, I will discuss the pillars of the approach that shore up these exemplary programs. I will also argue for possible improvements to the approach that could broaden access for more students to careers in data science.
Data is very familiar territory for academics. From physical sciences to social sciences and everything in between, academics are well versed in the scientific method and constantly seek evidence in any decision-making situation. One would think that training students for the data age comes naturally to faculty that surround themselves with data on a daily basis.
Alas, this data age is not a continuation of these traditions. It is enabled by new ways of thinking about data, presenting new opportunities and challenges. Of course, we can analyze numerical data of ever larger sizes, but data can exist in so many other forms. Biologists can analyze images of tissues and organs, as well as videos of behaviors across different species. Sociologists can study in detail complex human interactions from data left behind by people in social sites. Engineers can run simulations that generate data about the response of a system under many different conditions. Collecting, finding, understanding, analyzing, and utilizing responsibly all these very diverse kinds of data at scale gives rise to methodologies, technologies, and ideologies that our students need to first master and then reinvent. They are amalgamated under the rubric of data science, and they are as new to students as they are new to most data-minded academics. Hence the profound importance of universities that are leading the way in transforming how we teach data skills and how we prepare students for productive careers in data.
So as data science programs have emerged in most universities large and small, many see their role as meeting the demand for a data-minded workforce. But others, as is the case with Berkeley, want to capture the essence of data science into an overarching pedagogical approach to graduate truly data-skilled students. Berkeley’s approach is to see data science as emerging from the marriage between computing and statistics, and therefore has endeavored to teach students about computational thinking and inferential thinking.
In Berkeley’s core data science course, students learn computational thinking concepts such as abstraction and modularity. They also learn statistical concepts such as population sampling and uncertainty. Combining computation and inference provides a structured framework to introduce many important concepts in data science.
Naturally, in compressing both giant areas of thought into a single introductory course, many important aspects of each are not possible to cover. But perhaps it is worth reconsidering the nature and relative proportion of topics that should be included in a fundamental data science course.
Consider first whether learning very basic programming skills prepares students for the data science world. While assembling small programs is always a useful skill to have, data problems quickly outgrow such basic programming skills. It is easy to create simple but very inefficient programs, and without a basic understanding of algorithmic complexity and parallelism one will be easily stuck.
In addition, data science curricula should extend the original computational thinking concepts (Wing, 2006) into the data realm. Data sets will more often be so large that they cannot be downloaded and processed in memory, requiring sophisticated data systems and distributed access in order to be manageable. These and other basic computing concepts are central to the modern data conversation and therefore belong in a modernized version of computational thinking.
Another important point is that a traditional statistics focus on tabular measurements comes in detriment of the rich variety and sophistication of data sources that are often available. Networks, text, images, videos, timeseries, multimedia, geospatial grids, simulations, and other rich forms of data cannot be examined with statistical techniques alone but very quickly move on to requiring new concepts and new types of analysis. It is important for students to be aware of the rich smorgasbord of algorithms for data analysis.
This would argue for teaching a broader set of computing fundamentals rather than investing time in teaching introductory programming. The latter is actually possible now that Berkeley has given us open-source web-accessible electronic notebooks as highly r(e)usable vessels for running code (Pérez & Granger, 2007).
Jupyter Notebooks have democratized the access to running code for scores of students and data enthusiasts, particularly in science. Back in 2015 there were already an estimated one million users of these notebooks, and by 2018 more than 2.5 million open-source Jupyter Notebooks were available in GitHub (Perkel, 2018).
Many of us use Jupyter Notebooks in introductory data science classes, much like they are used at Berkeley’s introductory data science course. I have watched our humanities students easily run sophisticated data science software through our carefully designed notebooks without ever having to look at the actual code or go through complicated installations of software libraries. Before notebooks, we used intelligent workflow systems to provide students with a simple graphical user interface to complex multistep analyses that non-computer science students could easily grasp (Gil 2014, 2016). But while workflow sharing has not had a significant uptake, notebook sharing quickly became very popular.
The notebook culture is fundamentally welcoming, and in that it is notably different from the programming culture. I hear all the time the experiences of many students and beginner data scientists posting questions about programming in online forums and getting berated for not seeing previously posted answers, which often require a nontrivial understanding of the question in the first place. In contrast, notebook users can get the simplest questions answered in a friendly way. This gives students a can-do attitude that is so necessary for successful careers in data science: There is always something that one does not know and needs to learn by figuring it out on one’s own.
A major innovation in the data science program at Berkeley is the connector courses. These allow students to go deeper into the kinds of questions and problems in a particular discipline of study that could be tackled with data science approaches. The design of these courses follows a proven pattern and is interesting in itself, but we highlight here an even more interesting effect that these connector courses have had on the campus.
Faculty in different disciplines teach these data science connector courses. That means that faculty have to learn themselves about the new data science techniques, and apply them to problems in their discipline so they can be presented in the class. The value of these courses to the students is undeniable. But one cannot understate the profound transformation that this has in the faculty themselves. By learning to use and teach data science techniques, the faculty are in a great position to adopt these innovations in their own research. These courses are serving as an engine of change across the entire campus as much as they are serving the education mission of the university.
Another great transformative effect of the data science undergraduate curriculum has been on the student body. Scores of undergraduates serve every semester as teaching assistants and run lab sections. Through their involvement in teaching, these students gain not only a deeper grasp of the materials but a greater ability to explain data science concepts to beginners. This is another key skill in a data science career: communicating data science work to managers and decision makers who often lack technical expertise.
A hallmark of Berkeley’s approach has been to treat data science as a core component of undergraduate education. Thinking about data should come naturally to students in any major, and be as helpful to them as communication or critical thinking skills are, no matter their pursuits. This was a bold move at the time that has shown to be forward-thinking and has had an impact on many other universities. Thousands of students take the basic data science course every year, thousands more are reached through online platforms for course delivery. This is an impressive change for an academic institution by any measure.
But is the Berkeley approach really democratizing access to a data science education for all students? In a campus with 30,000 undergraduates, an introductory data science course that is taken by 3,000 students each year is actually leaving a significant portion of students behind. One reason might be the strong requirements in math and statistics across the data science course offerings. As important as statistics is to data science, it is worth considering the alternatives. In any given data science project, for each hour spent on inferential thinking there are probably dozens of hours spent on computational thinking and many more on general creative workarounds to problems of data quality and lack of data in the first place. Should we require all students to have extensive knowledge of statistics before they can get basic training in data science? Doing so may end up placing severe barriers to democratizing access to data science. My belief is that a little statistics go a long way in data science, and with minimal mathematics requirements we will be more successful at attracting the broader base of talent that we know is needed to tackle the future of data.
I would like to thank the students in the USC Data Science program who inspire me every day to improve how we approach teaching and learning in this unique discipline. I would also like to thank my colleagues and collaborators over the years, particularly those who helped me see computing and data science through the lens of another discipline. I am very grateful to Kevin Knight and Victoria Knight for their comments on an earlier version of this article.
Adhikari, A., DeNero, J., & Jordan, M. I. (2021). Interleaving computational and inferential thinking in an undergraduate data science curriculum. Harvard Data Science Review, 3(2). https://doi.org/10.1162/99608f92.cb0fa8d2
Gil, Y. (2014). Teaching parallelism without programming: A data science curriculum for non-CS students. Proceedings of the Workshop on Education for High Performance Computing (EduHPC) (pp. 42–48). IEEE. https://doi.org/10.1109/EduHPC.2014.12
Gil, Y. (2016). Teaching big data analytics skills with intelligent workflow systems. Proceedings of the Sixth Symposium on Educational Advances in Artificial Intelligence (EAAI), 30(1). https://ojs.aaai.org/index.php/AAAI/article/view/9860
Pérez, F., & Granger, B. E. (2007). IPython: A system for interactive scientific computing. Computing in Science & Engineering 9(3), 21—29. https://doi.org/10.1109/MCSE.2007.53
Perkel, J. M. (2018, October 30). Why Jupyter is data scientists’ computational notebook of choice. Nature, 563, 145—146. https://doi.org/10.1038/d41586-018-07196-1
Wing, J. (2006). Computational thinking. Communications of the ACM, 49(3), 33–35. https://doi.org/10.1145/1118178.1118215
This discussion is © 2021 by the author(s). The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.