Working together over the past 5 years to figure out how we might best place ‘practice’ at the center of the data science education we offer, through Boston University’s M.S. in Statistical Practice (MSSP) program, has been a pleasure and a passion for the three of us. In reading the nine pieces contributed as discussion of our article, it was both instructive and inspiring to be reminded how much that pleasure and passion are shared by so many colleagues across such a wide range of disciplines, experiences, and walks of life. We are grateful to them for sharing these stories and insights resulting from their efforts. Moreover, it is clear that there is much cross-fertilization possible in this area. Below we comment briefly on some of the many themes raised by the discussants, particularly as we see them touching back upon those of our own in the original article. We close musing on the prospect for practice-centric learning to serve as a driver toward democratization of data science education.
The wealth and diversity of experiences communicated in the creative Q&A format of the piece by the Early Career Board members is wonderful and should be required reading for anyone developing or evolving a data science program. There were a few experiences that especially resonated. The notion of approaching consulting and collaborations as both a student (learning from your client) and a teacher (in explaining what you propose or have done) is perfect and very much captures the spirit we try to instill in our students. Similarly, the importance of balancing theory and practice in producing effective practitioners is a perspective central to our approach, as we mention toward the end of our article. And the ‘right’ balance will vary by individuals and their situations. Ultimately, however, our feeling is that regardless of which component(s) is placed at the center of a comprehensive data science education, all major components should be represented. Being conscious of balance and integration are critical. Finally, the point that value must be generated by the practicum projects is an important one. Indeed, as noted by these authors, industry partner projects necessarily are more speculative than not, rather than mission-critical. The trick is to find projects that clearly could be important to the company, leaving the demonstration of that potential value to the students, and hence inspiring both students and partners to come together in the work.
Fayyad and Hamutcu rightly call out the data science equivalent of the proverbial 500-pound gorilla in the room—just what is data science? And they provide a summary of their invaluable experience examining this very question, as it relates to data science education, at the Initiative for Analytics and Data Science Standards (IADSS) and the workshop organized by IADSS at the KDD 2020 conference. Like these authors, our preference too is to view data science as an ‘umbrella term.’ Their discussion on the challenge (futility?) of training ‘data science unicorns’ is important. Our approach has been to position our MSSP program as a statistically oriented data science degree, which helps significantly in avoiding trying to cover everything that might fall under a data science education. At the same time, we aim to provide an educational framework that is sufficient to allow all students a broad common experience while, at the same time, allowing each student ample opportunity to specialize in ways true to their prior training and experiences and their goals. So, for example, as these authors conjecture, we indeed allow students within (sub)teams to specialize in their roles to a reasonable extent, thus leveraging and further developing their strengths in a way similar to what they can expect in medium-to-large teams in an industry setting. On the other hand, for the many students who may instead find themselves on small teams or alone (particularly in their first job after the degree), the general education they receive can be a powerful enabler for both them and their employers.
Hardoon similarly raises concerns about the breadth of topics that fall under the data science umbrella, and its implications on data science education, in asking whether data science programs are at risk of simply producing a ‘jack of all trades.’ That is, a world of generalists, without specialists. Again, the reality of data science teams with mixed expertise in professional settings is important, as Hardoon points out, and something that can be mimicked in a practice-centric education. At the same time, the knowledge that prospective employers will be hiring in this way—seeking not just those with ‘data science’ formally in their degree name, but also those specializing more and emerging instead with degrees saying ‘computer science,’ ‘engineering,’ ‘mathematics,’ ‘statistics,’ and related—is a good reminder, in our collective rush toward developing data science education, of the critical role played and the value brought by degrees in each of these areas toward doing effective data science in practice.
Mira and Wit touch on this theme again to some extent, but focus instead on the outcomes that are desirable and that could be used to develop key performance indicators. And they are correct that problem clarity, a sense of being an experimental detective, and adversarial reasoning (most frequently emerging in Q&A during our consulting project presentations!) are all important elements present in our Statistics Practicum. We would, however, disagree with the analogy of the practicum to a capstone—both in the literal sense of the architectural image invoked by the authors and also, in particular, in the sense that the word has been used traditionally in education. As we note early in our article, capstones typically follow after students have progressed through the vast majority of their so-called foundational courses. And, just as Mira and Wit describe, such capstone courses are viewed as sitting at the apex and pulling things together as the last step. Our practicum, in contrast, acts very much as a central pillar around which the program is built, present from the ground up, rather than only sitting at the top.
Utts, in her fascinating and personal retrospective, highlights another key alternative (i.e., in addition to capstones) to our statistics practicum: the statistical consulting course. This is a time-honored and, in our opinion, highly effective mechanism for injecting a practice-centric element into a (usually) graduate-level education in statistics. In fact, courses at Stanford and UCLA that two of us took, and at the University of Chicago that one of us taught, were formative experiences that helped mold the perspectives we have brought to bear in developing our statistics practicum. And, importantly, as Utts points out, it generally takes a much smaller investment to set up and run a statistical consulting course than a practicum like ours. Actually, in explaining our MSSP program to both university administrators and prospective students, we often find it useful to simplify the discussion by saying that we effectively developed (i) a new master’s program, and (ii) a statistical consulting program, with a key innovation being that they are intimately entwined through the statistics practicum at the center of everything. Additionally, as an aside—we love the title “Statistical Practice Is Not a Spectator Sport” and plan to shamelessly borrow it (with proper attribution, of course) going forward!
Yu and Li remind us, in recounting their own experiences teaching practice-centric learning to master’s students, that there are many variations that one can take on the theme. Citing a comparatively less rich set of possible industry partnership options for them compared to what we have at Boston University, they describe how they developed a two-semester data science project as a core course, where projects come from university partners. In the first year of our program, we in fact took a similar approach for similar reasons—partner projects in our statistics practicum for our first class of students (all six of them!) were defined leveraging existing collaborative research opportunities with colleagues in various domain areas on campus. Importantly, successful completion of these projects, in turn, provided us with a portfolio of sorts that we then brought to prospective partners in industry and government. Similarly creative was the solution Yuan and Li put forth for challenging their data mining students to present effectively to nonexperts by enlisting high school students to play the role of a ‘boss.’ Finally, we are highly sympathetic to and appreciative of the fact that these authors raise the challenges to practice-centric learning in the COVID-19 era and beyond. In our own program, issues we have had to address range from minor issues, such as how to facilitate dynamic group discussions over video conferencing and how to schedule classes that work for students joining in across different time zones, to more serious issues, such as those regarding overseas access of protected data (from both legal and data security standpoints). The latter are particularly challenging for a practice-centric program like ours, in that we needed to forgo some potential partnership opportunities with companies whose concerns about data access from abroad could not be adequately addressed.
Horton and colleagues raise the excellent point that instantiations of practice-centric data science education need to extend beyond postgraduate degree programs and, ideally, would be introduced at the undergraduate level. The National Science Foundation (NSF)-sponsored Data Science Corp: Wrangle, Analyze, Visualize (DSC-WAV) project is making important inroads toward achieving this goal. And, as those authors point out, there are many commonalities between their experience and ours, not only in approach but also in terms of highs and lows. For example, adapting an agile development framework within which to conduct the work, and reaping the corresponding benefits in terms of coordination of team efforts and overall project progress, versus the friction that arises when it comes to assessment within what is still ultimately an academic environment. Sustainability beyond the initial NSF funding for this initiative is also, as the authors say, an important challenge to be resolved. They mention prospects for folding the essence of DSC-WAV into the curriculum as a type of consulting course. A similar approach has been pioneered at our own institution by the BU Spark! Program, now as part of the new Faculty for Computing and Data Sciences unit. Spark! is the result of five-plus years of consistent investment and experimentation by the university in experiential learning in computing and data science for undergraduates. There are on/off-ramps for students of all years and levels of training to join the Spark! journey. Central to this is the “X-Lab” experience, where external projects can be incorporated into curricula in a variety of ways, including through a number of practicum courses with different emphases. This model, wherein a central entity (Spark!) takes the lead on developing and maintaining partner relationships, puts in place easily adoptable and replicable constructs (i.e., X-labs), and facilitates interfacing with more traditional academic courses, is particularly useful for addressing challenges around scaling and quality control.
Uminsky similarly raises the importance of being able to replicate and scale practice-centric innovation in data science education of the type being discussed here. His call to action is worth repeating: “there is an opportunity to develop sustainable, modular, and effective educational models that allow for easier adoption by data science educational programs of all varieties, in all locations (e.g., urban, ex-urban, and rural), to support centering practice.” And we appreciate his ask that we “highlight the value of the MSSP practice-centric model as an engine for positive impact, as a whole, on underrepresentation in data science.” The opportunity inherent in data science education to effect positive change on underrepresentation has been noted by many at this stage, including at a dedicated session of the National Academies of Sciences Data Science Education Round Table co-led by one of us in 2018.1 But with the challenges to successfully leveraging this opportunity being many (ranging from overworn excuses about ‘the pipeline’ to serious barriers around financial reach (particularly acute at the master’s level, where most institutions’ financial models do not support generous aid), it is time to start focusing on solutions rather than potential. We would particularly echo the importance of Uminsky’s emphasis on calling for trans-institutional collaboration near the end of his discussion (of which DSC-WAV is an example!). Individually, every academic institution looks inward and sees resources more limited than necessary to address underrepresentation through scaling of practice-centric data science learning. Collectively, however, our resources are vast and our awareness of the commonality of both our challenges and our possible solutions is quickly growing. This suggests that what is needed at this stage is an increased emphasis on mechanisms, processes, and logistics for capturing and disseminating replicable infrastructure to support successful experiences that can be extended and scaled. This itself is a truly challenging space. The University of California at Berkeley’s experiences in developing, scaling, and making replicable their innovative Data 8 course is an important example worth studying in this regard.
A critical and necessary aspect of data science education moving in this direction will be standardization. As Rodolfa and Ghani reflect in their insightful discussion, there is an important parallel to be drawn between medicine and data science, where both have practice at their center while drawing in turn on multiple ‘core’ disciplines. And there are important lessons that data science education can learn from medicine, one of which is the positive impact had on medical training when standardization was introduced. How best to standardize data science education, and to what extent, is an open question of interest to many in the community currently. Beyond what specific topics to cover, there are questions of what length of time programs should be, how to balance breadth versus depth (i.e., ‘unicorns’ versus ‘experts’), and how to codify what it means to have a successful practical experience. This latter point, if anything, in our opinion just further highlights the need for developing an appropriate and effective set of assessment tools for practice-centric data science educational environments. Coordinated efforts at standardization of data science education are already underway, with groups like the Accreditation Board for Engineering and Technology (ABET) and the Computing Sciences Accreditation Board (CSAB) having recently formed a joint task force with experts from other professional societies.
The practice-centric perspective in data science education, while far from new, arguably is reaching a tipping point in the modern era. While there is still plenty of work to be done developing and assessing best practice around pedagogy within individual classrooms and programs, increasingly it will be an investment in the infrastructure for delivering replicable high-quality practical experiences at scale that will truly move practice to the center of the educational experience in data science widely. And, in doing so, the practice-centric component will manifest as a central driver to the democratization of data science.
Eric D. Kolaczyk, Haviland Wright, and Masanao Yajima have no financial or non-financial disclosures to share for this article.
©2021 Eric D. Kolaczyk, Haviland Wright, and Masanao Yajima. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.