Whether students begin their data science education through Data 8 or an alternative program, we are especially energized by the concept of connector courses as a way to get experience answering questions in a data-driven domain. We see the potential for these types of courses to provide flexibility in bringing data science to other institutions and increased engagement for students and faculty alike.
For those institutions where interest in data science is growing, but there is no formal structure for it yet, aiming for a handful of connector courses might identify the faculty working on data-intensive research across the curriculum, who may be allies in a larger data science initiative. When first run as pilots, these connector courses would not benefit from the same assumed knowledge of a class like Data 8 right away, but assessing the heterogeneity of background knowledge of students interested in these classes might help define what needs to go in a foundation course moving forward.
Just as the connector course idea broadens the pool of faculty members who may teach a data-related course, these interdisciplinary courses will likely engage a broader pool of students. By showing students all of the different ways data can help us ask and answer questions of interest, data-driven thinking will be solidified as a critical practice within the liberal arts.
Not every program has to offer a major in data science, but the goal may be to make sure every student knows how to approach a problem using data in the context that most interests them. The ideas in “Interleaving Computational and Inferential Thinking: Data Science for Undergraduates at Berkeley” (Adhikari et al., this issue) certainly provide a path forward toward this ideal, and we started to think of the article as providing a recipe for undergraduate data science at UC Berkeley (perhaps cooking is particularly on the brain in these pandemic times). What is data science if not a new fusion cuisine of statistics, computer science, and domain knowledge? We decided to contribute new flavors, cooking up connector courses that we would be excited to take ourselves. Enjoy our tasting menu focused on data science methods and analysis considerations.
How do we combat the garbage in, garbage out phenomenon? Just as chefs may ‘pick’ their own ‘ingredients’ and check that they are ethically sourced, data scientists can benefit from getting involved in the data collection process. A connector course could ensure data analysts don’t just treat data as a given; think of a typical experimental design and survey sampling class with a twist.
This class would bring together the principles of data collection and the ethical consequences of that data collection. What questions can we not answer with an experiment? What can go wrong in a survey? What problems can happen downstream as a result of data collection challenges? Students get the chance to evaluate the costs, both literal and social, of their data work alongside traditional tips for sample size and power calculations.
As our abilities to store and process large quantities of information have improved in recent decades, fields of all disciplines are leveraging the power of data to solve problems and make discoveries. However, complex data sets can be overwhelming. It can be hard to know where to start, much less discern relevant patterns.
This connector course would aim to help students apply and expand on the foundational data analysis skills learned from Data 8 by making these data sets more approachable through an emphasis on dimensionality reduction and visualization of patterns.
There are many considerations and approaches when reducing the data dimensions. What approaches learned in Data 8 still serve us well? As we start to simplify, how can we make sure we’re not over- or under-simplifying? How applicable are particular methods across a variety of domains? Students will learn to wield dimensionality reduction techniques to identify the patterns, connect them to the context, and leave the noise on the chopping block. Visualizations will help reveal the many facets that the data has to offer. Different contexts for motivating examples abound, from linguistics to genomics.
There is no one question to ask of a data set and no one way to answer a question using data. Classic templates exist as a starting point, including those learned in Data 8, but there is room to customize for a given situation or personalize for an individual preference. A connector course centered on open data in general, or even narrowed on a particular open data set like the U.S. Census, American Community Survey, or cities’ open data portals, could give students the opportunity to define their own questions of interest and take steps toward answering those questions.
This class would help bridge the gap for students with newfound knowledge and skills from Data 8 who still have logistical questions about how to get started with their own data investigation. Where do I find data? How do I access it once I find it? What questions that I’m interested in can even be answered with this data? Given a ‘test kitchen’ environment, students can explore the many flavors of data and develop their own analysis tastes.
Or, as the original saying goes: ‘the proof of the pudding is in the eating.’ Just as you have to try out food in order to know if it is good, you have to deploy a prediction method on real-world, unseen-in-training data to know how useful it is. In Data 8, students focus on reasoning with data and are introduced to computation as a tool for inquiry, and to prediction methods as a means of making evidence-based decisions. Since these decisions are often impactful, we need a way of knowing how accurate our predictions would be when deployed in the real world.
Classical statistical wisdom tells us to split the available data for a prediction task into three parts, each representative of the whole data, namely—training, validation, and test sets. But how large, relative to each other, should these three data sets be? Is 80%/10%/10% right? Might 60%/15%/25% be better? What metric should we use to compare the method’s predictions to the ground truth? And how can we check if each subsampled piece is representing the whole data set well? Could it be advantageous to purposefully deplete a subset of the data in the training set, to test how the method performs under ‘stress’ or on unseen scenarios? What about applying our trained, validated, and tested prediction method to new data, in the wild, where we don’t have ground truth to compare the method’s predictions to? What should we do when there is a substantial data set shift, that is, the data the method is trained on is notably different from the data we are hoping to apply it to?
This connector course will go through the dangers of double-dipping (i.e., reporting prediction accuracy on the same data used to train the model), the importance of quantifying data set shift between training/validation/test data and the data that the method will be deployed on in the wild, and strategies for collecting additional ground truth data when the shift is substantial.
Data science often happens under constraints. When tackling a problem we assess our resources including time, money, and data availability. We might need to adjust our goals with these limitations in mind. A case studies–based connector course could give students a taste of doing high-stakes data analysis under a time crunch to prepare them for fast-paced data science environments.
This class would be structured around case studies from a variety of data-related fields. The common theme would be some kind of constraint that had to be accommodated in the analysis (think the modern equivalent of the O-ring debacle, where extrapolation occurred under some political pressure). What choices lead to ‘good enough’ results? What aspects of an analysis should not be sacrificed at any cost? What would our Data 8 professors do in a pinch? Students would learn apprentice style as they worked through examples of others’ decision-making processes.
Many research results are taken with an extra grain of salt if they could not be independently reproduced. Some of the reasons for the observed lack of reproducibility are unavailable data, obsolete code, or an overall lack of transparency about what happened behind the scenes. A connector course that would help students understand the benefits of computational reproducibility and help them achieve it in practice would produce well-equipped data scientists conducting reliable and reusable research.
This connector course would address the following three questions: (1) What are the ethics behind data and code sharing? It is important to understand that data sharing is not always possible, but also that in some cases, sensitive data can be shared using novel tools. (2) How do we achieve analysis automation and reproducibility? There is a surge of existing infrastructures like data and software repositories, tools like containers and workflow engines, and functionalities within a native programming language to help the researchers develop portable and reusable code. (3) How do we comprehensively document resources and enable their reuse? Adequate repository organization and documentation are critical for reuse, and they should be emphasized in the classroom. Lastly, open research resources need to be accompanied by permissive licenses, and the students should learn how to choose and use them. As a result, the students will become proficient curators of their work, competently sharing their work for easy review or reuse.
Just as photos and menu descriptions play a large role in convincing diners to order a dish, figures and reports help convince others that what we did is defensible, useful in context, and contributes to the field of data science. As a complement to the current Writing Data Stories connector course, another communication-style course could focus on argument-making over narrative-building. Another possible distinction is to have a connector course focused on communication to broader audiences and one focused on more formal communication to our peers.
How do we distill a data-driven argument from a body of work and hone that argument in both written and visual form? Students can learn from others’ approaches by reading data-driven arguments and evaluating data visualizations, determining what is and is not convincing. They can build their own set of best practices for writing and data visualization, compare them to classical theories of rhetoric and perception, and implement them to strengthen their own data-driven arguments.
Aleksandrina Goeva, Peyton Jones, Sara Stoudt, and Ana Trisovic have no financial or non-financial disclosures to share for this article.
©2021 Aleksandrina Goeva, Peyton Jones, Sara Stoudt, and Ana Trisovic. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.