Data science is emerging from an interdisciplinary integration of mathematics, statistics, computer science, and many other application domains such as business, transportation, biology, and education. Parallel to the emergence of data science, a new field is developing as well—data science education.
Based on our expertise in data science education, gained in our extensive research, the main research topics of the emerging field of data science education have not yet been mapped, and its scope, as reflected by the papers published so far on data science education, has not yet been defined. Several systematic reviews on topics in data science education have been written in recent years, but they tend to focus on specific topics or specific populations, such as K-12 learners.
To partially close this gap, we conducted a systematic machine learning–based literature review. We collected 1,048 papers in the field of data science education using both keyword searches in well-established scientific databases as well as semantic searches in semantic scientific databases. Using SPECTER to generate document-level embedding, we clustered the papers by applying the k-means algorithm. The titles and abstracts of the papers were examined manually to identify main cluster topics. The result is a framework of 26 clusters, organized into five superclusters: (a) curriculum, (b) pedagogy, (c) STEM skills, (d) domain adaptation, and (e) social aspects.
The large body of literature on data science education that exists and the framework of its research topics that we revealed indicate the birth of a new scientific discipline—data science education.
Keywords: data science education, systematic literature review, SPECTER embeddings
Data science is emerging from an interdisciplinary integration of mathematics, statistics, computer science, and many other application domains such as business, health, education, and transportation. In parallel, the new field of data science education is developing as well. As of today, data science education is growing mainly in the context of other educational communities, such as computer science education, statistics education, engineering education, business analytics education, social science education, and many other educational and research communities (Hazzan and Mike, 2021). Only a handful of journals are dedicated to data science education (for example, the Journal of Statistics and Data Science Education, the special issue of the Statistics Education Research Journal on Data Science Education, and HDSR’s Data Science Education collection)
Since the research and knowledge regarding data science education is spread over many communities and journals, it does not seem trivial to ask the following question: What are the main research topics of data science education? Several reviews were written in recent years on specific topics within data science education. Such topics include, for example, data science curricula (Wu, 2017), data science skills (Gurcan & Cagiltay, 2019), data science for non-STEM majors (Barboza & Teixeira, 2020), and machine learning for K-12 pupils (Foster & Tasnim, 2020; Lupusoru et al., 2021; Martins & Gresse Von Wangenheim, 2022). These reviews, however, focused on specific aspects of data science education, but did not reveal a comprehensive picture of what data science education is. To partially close this gap, this article presents a systematic review of the literature on data science education.
While a systematic literature review is a well-established method for academic research, the case of data science education poses several challenges. First, no designated journals on data science education exist, nor has even a single research community been established that can help focus the search. Second, any systematic literature review is limited by the capacity of the human researcher and, therefore, it may be appropriate for a narrow topic of interest with several dozen to several hundred papers. Data science education, however, as presented later in this article, is a wide field of research with thousands of published research papers, far beyond the surveying capacity of a human researcher. We therefore recruited data science itself for this task, and specifically the emerging technology of producing machine learning–based systematic literature reviews.
The result is a framework of 26 clusters, organized into five superclusters: (a) curriculum, (b) pedagogy, (c) STEM skills, (d) domain adaptation, and (e) social aspects. The first supercluster addresses the components of data science programs, that is, what to teach. The second focuses on how to teach data science, which is a complex task, as will be described later. The third and fourth categories refer, respectively, to the teaching of specific components of data science: teaching data science within an application domain and the teaching of the STEM components of data science: mathematics, statistics, and computer science. The fifth category, social aspects, focuses on the learners as data science users and practitioners.
The rest of this article is organized as follows: Section 2 presents the background for this research, including a review of the short history of data science education and a theoretical background regarding manual and algorithmic systematic literature reviews. Section 3 describes the research method and section 4 presents the results. We discuss the results in section 5 and conclude in section 6.
In recent years, several committees were formed to discuss data science education, and specifically to formalize a data science curriculum. In 2015, the National Science Foundation (NSF), together with the ACM (Association for Computing Machinery) Education Board and Council, organized a workshop on data science education (Cassel & Topi, 2016). The product of this workshop was a report titled Strengthening Data Science Education Through Collaboration. One of the main motivations for organizing this workshop was the shortage of data science professionals, which in turn prevented organizations and societies from enjoying the potential benefits of data science. Although by 2015 several educational programs had already been launched to address this shortage, many workshop participants felt that the quality and future direction of these programs, as well as their structures and practices, did not provide a sufficiently broad perspective on data science and should be defined better. The goal of this workshop was, therefore, “to start a conversation to address these concerns and develop a deeper integrated understanding of the best ways to offer data science education, ultimately leading to a better prepared workforce” (Cassel & Topi, 2016, p. 3).
In the summer of 2016, the NSF and the Institute for Advanced Study at Princeton University funded another workshop that focused on formulating curriculum guidelines for an undergraduate data science degree. This workshop was held at the Park City Math Institute at Princeton University in New Jersey. The workshop product was a report titled Curriculum Guidelines for Undergraduate Programs in Data Science (De Veaux et al., 2017) that, among other things, emphasized that the guidelines were not prescriptive, but rather, designed to inform and enumerate the core skills that a data science major should have.
In December 2016, the US National Academies of Sciences, Engineering, and Medicine established the Committee on Envisioning the Data Science Discipline: The Undergraduate Perspective, charging it with the goal of setting forth a vision for the emerging discipline of data science at the undergraduate level (National Academies of Sciences, Engineering, and Medicine, 2018). The committee recognized two challenges in designing data science programs: the diversity of the domains in which data science programs are developed and the complex nature of data science itself. The committee’s final report stated that:
Current data science courses, programs, and degrees are highly variable in part because emerging educational approaches start from different institutional contexts, aim to reach students in different communities, address different challenges, and achieve different goals. This variation makes it challenging to lay out a single vision for data science education in the future that would apply to all institutions of higher learning, but it also allows data science to be customized and to reach broader populations than other similar fields have done in the past. (National Academies of Sciences, Engineering, and Medicine, 2018, p. 16)
In parallel to the efforts made in the United States, the European Commission funded the EDISON project in 2017 to define the data science profession and promote the education and training of data scientists (EDISON, n.d.). The EDISON project produced the EDISON Data Science Framework (EDSF), which is composed of four documents that define the data science profession and its required competences and knowledge. In addition, the EDSF also presents a model for a data science curriculum (EDISON, 2017).
The last committee we review here is the ACM Data Science Task Force formed in 2017 by the ACM Educational Council. The task force’s mission was to “explore a process to add to the broad, interdisciplinary conversation on data science, with an articulation of the role of computing discipline-specific contributions to this emerging field” (Danyluk & Leidig, 2021, p. 6).
Following the development of data science courses and data science curricula for undergraduates, an effort was made to adjust and develop data science courses and data science curricula for school pupils as well. In 2017, a symposium on data science curricula for schools was held in Paderborn, Germany (Biehler & Schulte, 2018), and soon afterwards, a draft of a data science curriculum for school pupils was published (Heinemann et al., 2018). Following this publication, the International Data Science in Schools Project (IDSSP) published a curriculum framework for introductory high school data science (Fisher et al., 2019). In addition, the Mobilize Introduction to Data Science program was developed aiming to enhance computational and statistical thinking skills so that pupils will be able to access and analyze data from a variety of traditional and nontraditional sources (Gould et al., 2018). In Israel, a data science course for high school computer science pupils has since been integrated into the current, official Israeli high school computer science curriculum (Mike et al., 2020). Rosenberg and Jones (2022) recently completed a comprehensive systematic review of data science programs for K-12.
Beyond the academic attempts to define data science curricula and programs, the industry also launched several initiatives to define the data science body of knowledge and skills, from a commercial perspective. One example is the Initiative for Analytics and Data Science Standards (IADSS), developed by Usama Fayyad and Hamit Hamutcu (2020). This initiative was motivated by the need of the data science ecosystem to address the needs of data science education and to promote cooperation between academic institutions and nonacademic initiatives that also train data scientists.
The aim of this research is to characterize the field of data science education by identifying the main research topics explored in the field using a systematic literature review. According to Kitchenham (2004), “Systematic review is a means of evaluating and interpreting all available research relevant to a particular research question, topic, area, or phenomenon of interest” (p. 1). To fairly evaluate a research topic, a trustworthy, rigorous, and auditable methodology is required, and indeed, over the years several methods have been proposed for systematic literature reviews, for example, by Kitchenham (2004), Nightingale, (2009), and Torres-Carrión et al. (2018).
The fast growth in the number of scientific publications and the concurrent advances in natural language processing (NLP) have led to the development of new systematic literature review techniques (Tandjung & Fudholi, 2022), which largely fall under the umbrella of topic modeling. In general, topic modeling refers to algorithms that can extract latent variables from large data sets and can, for example, support the structuring of databases of academic papers into groups based on similar focus areas (Vayansky & Kumar 2020). In recent years, several NLP-based systematic literature reviews have been published in several domains, such as machine learning research (Sharma et al., 2019), mobile learning (Hamzah et al., 2020), and artificial intelligence in education (Paek and Kim 2021).
Running an NLP-based systematic review requires numerical representation of documents. Although several methods for representing documents as numbers exist, for example, TFIDF (term frequency-inverse document frequency) (Joachims, 1996), the current state-of-the-art method for representing words and documents is deep learning. SPECTER is a method based on training transformer language models to generate document-level embedding of scientific documents (Cohan et al., 2020). The training is based on pretraining a transformer language model (BERT) on the citation graph, which is a powerful signal of document-level relatedness. Recent research demonstrated using SPECTER embeddings for academic research. For example, Yamazaki et al. (2022) experimented with SPECTER representation for paper mining and Mozgai et al. (2022) used the technique for scoping reviews.
To cluster the documents, we use the k-means algorithm (Hartigan, 1975) and in order to label the clusters, we applied a qualitative manual process of coding and categorization for revealing a grounded theory (Glaser & Strauss, 2017).
The research follows the following steps (see Figure 1):
First, we planned the research according to the systematic literature review methodology.
The second stage, data collection, is common to systematic literature review and to grounded theory. The data science technique used in this stage was a semantic search, which is a subfield of NLP (see Section 3.2).
The third stage, data synthesis, was performed by applying data science methods and the grounded theory practice of coding and categorization. The data science methods used in this step were document embedding using SPECTER and the k-means algorithm (see Section 3.3). The coding and categorization practice of grounded theory is an iterative process that aims to enhance category coherence, and in which the researchers identify categories and refine them as the data is rereviewed. The iterative process of coding and categorization is reflected in the co-refinements of the k-means hyperparameters and the semantic understanding of the clusters.
Finally, we report our findings, which reflect a first attempt at defining and mapping the emerging field of data science education (see Section 4). Using grounded theory terminology, this mapping is the emerged theory, or in less formal terminology, it is the theoretical framework.
The compete code for this research can be found here: https://github.com/Data-Science-Education-Review/litreture-review.
Figure 1. Research methods: Systematic literature review, grounded theory, and the data science technique of machine learning.
As mentioned, a systematic literature review is a common academic research method and several guidelines have been published for its implementation, for example, Kitchenham (2004), Nightingale, (2009), and Torres-Carrión et al. (2018). These guidelines, however, were written with manual systematic reviews in mind, and do not consider algorithmic systematic reviews. We therefore followed the Procedures for Performing Systematic Reviews (Kitchenham, 2004) with the necessary adaptions for the case of an algorithmic systematic review. According to these guidelines, after defining the research question(s), the following steps are required for data collection: defining the search strategy, defining the selection criteria and procedures, and defining a quality assessment measure for each study.
Gusenbauer and Haddaway (2020) investigated which academic search systems are suitable for systematic reviews or meta-analyses. Their analysis included 28 resources, including Google Scholar and Web of Science, which they classified either as being suitable to serve as a principal source or as a supplementary source. In our case, we collected data from two principal sources that are relevant for data science education research: the ACM Digital Library (ACM-DL) and Web of Science (WoS). As supplementary sources we selected Google Scholar because it contains references to a body of research that is not covered by Web of Science (Martín-Martín et al., 2018), Semantic Scholar database because it allows a semantic search of academic papers (Ammar et al., 2018), and the IEEE Xplore database as a major source of engineering education research.
To answer our research question ‘What are the main research topics of data science education?,’ the research aims to collect all papers published in academic journals and conferences that discuss data science education. Kitchenham (2004) proposed searching for various terms that represent various facets of the research question. To limit the scope of the review, we searched only for the term ‘data science,’ omitting related terms such as ‘data analytics’ and ‘data engineering.’ At the same time, to include the core components of education, we did include the terms ‘teaching,’ ‘curriculum,’ and ‘pedagogy’ (Shulman, 1986). Selection criteria included the terms ‘data science’ and one of the terms ‘education,’ ‘curriculum,’ ‘pedagogy,’ and ‘teaching’ in the paper title or the paper abstract (see Appendix). We did not search for the search terms in the full text because a full-text search retrieved many non-relevant papers (for example, papers about data science applications in the domain of education).
For this research, which aimed to identify trends of a large body of research, we were interested only in the research topic and not the research quality. Therefore, we did not look at parameters of research quality such as research method or number of participants. We did, however, verify text validity, and all paper titles and abstracts were manually reviewed by all authors of this article. Papers with nonvalid titles or abstracts were excluded.
All search queries were executed in September 2022. ACM-DL, WoS, and IEEE Xplore support Boolean search in specific fields such as paper title and paper abstract. Google Scholar and Semantic Scholar do not support Boolean search and so for those databases, we run full-text database queries, retrieved all returned documents, and excluded papers that did not meet the selection criteria after they were retrieved (see Section 3.2.5).
Table 1 presents the total number of papers retrieved, excluded, and included in the research. Papers were excluded if the search terms were not found in the paper title or paper abstract (see Section 3.2.4). Papers were included only if we could retrieve their SPECTER embeddings from the Semantic Scholar database. Even though about 20% of the papers found in the other sources were missing from Semantic Scholar, Semantic Scholar does cover a wide variety of academic sources (see https://www.semanticscholar.org/about/publishers) and can, therefore, be considered a reasonable sampling frame for this research. We did not choose a larger number of clusters to maintain cluster sizes of at least 20 papers.
Database | Retrieved by Query | Meet Exclusion Criteria | Do Not Meet Inclusion Criteria | Included in Research |
---|---|---|---|---|
ACM-DL | 335 | 46 | 77 | 212 |
Google Scholar | 2,446 | 1,228 | 602 | 616 |
IEEE Xplore | 179 | 15 | 21 | 143 |
Semantic Scholar | 29,771 | 29,030 | 53 | 688 |
Web of Science | 675 | 222 | 31 | 422 |
Total | 2,081 | |||
Duplicates | 1,033 | |||
Total unique | 1,048 |
Data were analyzed using the iterative coding and categorization process of grounded theory, with the support of machine-learning techniques (k-means) to cluster the data. As the k-means algorithm is a hard-clustering algorithm that clusters each sample to a single cluster, it does not treat multiple-topic documents well (Mozgai et al., 2022). For example, a document that discusses teaching methods for machine learning for K-12 pupils should belong both to the teaching methods cluster, the machine learning cluster, and the K-12 cluster. To mitigate this shortcoming of the k-means algorithm, we used a large number of clusters, based on the elbow method (Syakur et al. 2018). As the number of clusters increased, the accepted clusters were smaller and more homogeneous. Also, a larger number of clusters represents more topics and minimizes the chance of overlooking a topic. Ultimately, we selected 26 clusters, each of which ended up containing 21 to 73 papers. The clusters were manually labeled by all authors of this article using the grounded theory methodology (Glaser & Strauss, 2017). This procedure enabled us to find connections between the clusters and build categories as well as superclusters. Disagreements between the authors regarding cluster labels were discussed until agreement was achieved. About two-thirds of the clusters were focused and were easy to label and to agree upon. About one-third of the clusters were broader, more difficult to label, and longer discussions were needed to reach agreed-upon labels.
We divide the limitations into two sections: SPECTER and clustering.
SPECTER
Even through SPECTER embeddings achieve superior results on various NLP tasks performed on academic papers, compared with other document representation methods, SPECTER representation has not been tested for the task of literature review and only a few researchers have used it for similar tasks such as scoping review (Mozgai et al., 2022).
SPECTER embeddings have been tested mainly on interdomain tasks. In our research, the papers were all collected from one domain (data science education) and so it was harder to differentiate between the papers.
We could not include all of the retrieved papers in the research since we could not retrieve the SPECTER embeddings of all papers.
Clustering
Cluster generation was sensitive to the random initialization of the k-means algorithm, indicating that other options of clustering were possible.
Some of the clusters were difficult to label and long discussions between the authors were needed to reach agreed labels. While the labels do fit most of the papers in each cluster, some of the papers seem to not fit the agreed-upon label, and we treat them as outliers.
The 1,048 papers collected were grouped into 26 clusters, which were then coded (labeled) by the authors to extract their topics (see Table 2; full details of the papers appear in the spreadsheet). After the clusters were labeled, they were organized in five superclusters: (a) curriculum, (b) pedagogy, (c) STEM skills, (d) domain adaptation, and (e) social aspects.
Cluster Number | Supercluster | Number of Papers | Cluster | Number of Papers |
---|---|---|---|---|
1.1 | Curriculum | 297 | Principles of data science curriculum design | 73 |
1.2 | Approaches to data science education | 72 | ||
1.3 | Introduction to data science course | 63 | ||
1.4 | Curriculum design for K-12 pupils | 48 | ||
1.5 | Curriculum design for data science majors | 41 | ||
2.1 | Pedagogy | 162 | Teaching AI and machine learning | 45 |
2.2 | Teaching methods | 45 | ||
2.3 | Online teaching | 37 | ||
2.4 | Tools and methods | 35 | ||
3.1 | STEM skills | 190 | Statistics education | 51 |
3.2 | Computer science for data science | 42 | ||
3.3 | Cloud computing in data science education | 36 | ||
3.4 | Data engineering | 32 | ||
3.5 | Statistics for data science | 29 | ||
4.1 | Domain adaptation | 206 | Big data and business analytics | 49 |
4.2 | Data science in health | 31 | ||
4.3 | Data science in digital technologies | 29 | ||
4.4 | Data science in biomedical | 27 | ||
4.5 | Data science in application domain (varied) | 25 | ||
4.6 | Data science in education | 23 | ||
4.7 | Data science in nursing | 22 | ||
5.1 | Social aspects | 193 | Ethics | 62 |
5.2 | Data science as a skill | 54 | ||
5.3 | Engagement | 30 | ||
5.4 | Diversity of learners | 26 | ||
5.5 | Enhancing diversity | 21 |
‘Data science curriculum’ is the largest supercluster defined, containing 297 papers out of 1,048. Papers in this supercluster discuss various aspects of data science curricula, including principles of data science curriculum design (Cluster 1.1) and approaches to data science education (Cluster 1.2). While papers in Cluster 1.1 address programs for various formats and diverse populations, special attention is given to three cases: the Introduction to Data Science course (Cluster 1.3), data science for K-12 pupils (Cluster 1.4), and data science programs for data science majors (Cluster 1.5).
The principles of curriculum design (Clusters 1.1 and 1.4) include various formats of data science curriculum frameworks, ranging from one-semester introductory courses to full degrees for diverse populations, for example, school pupils, undergraduates, teachers, graduates, and non-STEM students. The topics discussed in the papers include the required knowledge, competence, and skills, based on the academia’s and industry’s requirements; defining data science literacy, for example, data science skills required for anyone; balancing computer science and statistics knowledge with application domain knowledge; generating a collaborative and interdisciplinary curriculum, involving different faculties in academia, industry, and community; organizing teaching according to the data science workflow; integrating theory and practice using real-life data; developing lab environments to support working with real life data; using data from students’ domains; and assessment methods for data science.
Data science education faces many challenges that result from the integration of a variety of application domains and core STEM subjects: mathematics, statistics, and computer science. Consequently, on the one hand, non-STEM students are required to learn STEM subjects on a higher level than customary prior to the data science era, and on the other hand, STEM students are required to pay more attention to the (usually) non-STEM application domains. New teaching methods have, therefore, been developed for teaching data science. Four topics revealed within this supercluster are: teaching AI and machine learning (Cluster 2.1), teaching methods for data science (Cluster 2.2), online teaching (Cluster 2.3), and tools and methods for data science education (Cluster 2.4).
Data science is an integration of computer science, mathematics, statistics, and an application domain. While the papers in Supercluster 4 (domain adaptation) focus on data science education from the application domain’s viewpoint, the papers in this supercluster address data science education from the perspective of the STEM components of data science, that is, computer science and statistics. Clusters 3.1 and 3.5 are connected to statistical education and Clusters 3.2, 3.3, and 3.4 are connected to computer science education. We did not identify a cluster that focuses on mathematical education for data science.
The literature on statistics education discusses the following main topics: integrating data science into statistics curriculum and teaching data science for statisticians, teaching computer science as part of the statistics curriculum, teaching statistics as part of the data science curriculum, and methods of teaching statistics. One topic that recurs in many of these clusters addresses the interrelationship between statistics and data science, including similarities and differences between the two disciplines (see, e.g., MacGillivray, 2021).
Papers in Cluster 3.2 (computer science for data science) discuss the relationship between computer science education and data science education, similarly to the focus of Clusters 3.1 and 3.5 on statistics. Papers in this cluster address computer science curriculum for data science, data science curriculum for computer science, and the role of software development in data science. Clusters 3.3 (cloud computing in data science education) and 3.4 (data engineering) focus on more technical aspects of statistics and computer science tools and skills required for data science, such as notebooks, databases, Hadoop, cloud computing, computing environments for data science education, and big data infrastructure. We note that although ‘data engineering’ was not explicitly included in the search terms for data collection (3.2.2), the semantic nature of the search did retrieve papers on this topic.
Data science has become an essential tool for research and practice in a variety of application domains. Accordingly, this supercluster includes papers regarding the need to educate professionals in these domains as well as methods to embed data science into the curricula of these application domains. Papers in this supercluster address data science education in the following domains: business analytics (Cluster 4.1), health (Cluster 4.2), digital technologies (Cluster 4.3), biomedicine (Cluster 4.4), education (Cluster 4.6), and nursing (Cluster 4.7). Additional domains are aggregated in Cluster 4.5 and include geography, astronomy, hydrology, environmental science, biodiversity, and physics.
This supercluster aggregates different topics that relate to human and social aspects of data science education: ethics (Cluster 5.1), data science as a skill (Cluster 5.2), engagement (Cluster 5.3), and diversity (Clusters 5.4 and 5.5).
Ethics (Cluster 5.1) is a major issue in data science education, and 62 papers were grouped under this topic. Main themes that are discussed in this cluster are the importance of understanding the human aspects of data and the importance of data for understanding humankind as well as privacy, fairness, accountably, transparency, ethics in the curriculum, teaching methods for ethics, data science for social good, critical thinking, politics of data, feminist views of data, data and equality, and social and cognitive biases. Several papers use the COVID-19 pandemic as a case study for discussing many of the above-listed issues.
Papers in Cluster 5.2 (data science as a skill) discuss the importance of bringing data science education to everyone, since it is nowadays considered to be an important skill for any profession. Similar to this cluster, papers in Clusters 5.4 and 5.5 discuss the large variety of professionals who should learn data science. Unlike clusters in Supercluster 2 (integrating data science education into application domain programs), which focus mostly on curriculum aspects, papers in these clusters (5.4 and 5.5) focus on the motivation for teaching data science to different populations. Thus, Cluster 5.4 addresses the diversity of students and Cluster 5.5 focuses on enhancing diversity in data science and its relevance for all.
Papers in Cluster 5.3 discuss several engagement issues in data science education, for example, making programs more interesting for students by incorporating real-life projects and real-life data, and overcoming barriers for data science education, such as the required high level of mathematics, which not all learners are comfortable with.
Data science is an interdisciplinary domain that not only captures the interdisciplinary integration of computer science and statistics, but also encapsulates an integration of technical domains (computer science, statistics) with a real-world application domain (Adhikari & Jordan, 2021). This interdisciplinarity creates special challenges in data science education that are represented by the categories found in the literature review, as described below.
In the curriculum category, many papers discuss the challenge of generating a real interdisciplinary course that is feasible within the time constraints, on the one hand, and includes the broad required knowledge base, on the other. This topic was raised at the very beginning of the discussions regarding data science education. In an interview Robert A. Lue held with Laura Haas and Alfred Hero, co-chairs of the Committee on Envisioning the Data Science Discipline, Haas stated that the new element in data science is data acumen, the skill of selecting the proper data and tools for the problem at hand, and that:
Getting meaningful, correct, and useful answers from data requires skills that are typically not fully developed in traditional mathematics, statistics, and computer science (CS) courses. (Haas et al., 2019)
In the pedagogy category, many papers discuss methods for interdisciplinary teaching. The ACM Data Science Task Force Final Report summarizes this point nicely:
Each component of the data science environment: the domain that provides the data; statistics and mathematics for analysis, modeling, and inference; and computer science for data access, management, protection, as well as effective processing in modern computer architectures, is essential. However, a random collection of the three elements does not constitute a meaningful data science program. Data science is interdisciplinary and requires the effective integration of the three components to produce meaningful results. (Danyluk & Leidig, 2021, p. 10)
The large body of research that discusses domain adaptation and diversity further demonstrates the unique challenge of data science education, that is, to accommodate multiple diverse audiences. This diversity of audiences is reflected both in the level of learners (from kindergarten to postdoc students) and in their backgrounds (from humanities to exact sciences and engineering). This challenge was also referred to in the above-mentioned interview held by Alfred Hero who said:
Other findings [of the report], like the need to focus on data science ethics and on broadening participation in the field (and classroom) may be less familiar. (Haas et al., 2019)
The last category, social aspects, demonstrates unique challenges and opportunities in data science education. For example, since data science is also a research paradigm, it is relevant for many different academic researchers from diverse domains; as a result, it increases inclusion in STEM education and enables diversity in STEM subjects. Thus, according to Laura Haas,
The many types of problems to which data science is applicable will be appealing to diverse students, and this range of educational opportunities will be critical in engaging them, and beneficial to the industry as a whole. (Haas et al., 2019)
Another aspect of this cluster is data science ethics, which involves two different topics—data ethics and application domain ethics—which are integrated in the context of data science. One category, STEM skills, represents challenges that are common to data science education and to the domains forming data science: mathematics, statistics, computer science, and engineering.
Data science education, therefore, includes unique challenges and opportunities: teaching computer science and statistics within the context of application domains, teaching the importance of the domain knowledge to learners who specialize in computer science and statistics, teaching advanced computer science and statistics concepts to domain specialists who lack sufficient mathematical background, teaching research skills to data science learners and practitioners, teaching ethical considerations that derive from both data ethics and application domain ethics, and, finally, leveraging new opportunities in data science education to increase inclusion in STEM education. While most of the research on data science education is published in the context of other domains, such as computer science education and statistics education, many of the challenges and opportunities of data science education are unique to data science and stem from the interdisciplinarity nature of data science and the diversity of its learners. More research on data science education should, therefore, be performed in the unique context of data science education.
Data science is emerging from the interdisciplinary integration of mathematics, statistics, computer science, and many other application domains such as business, transportation, biology, and education. In parallel to the emergence of data science, a new field is emerging as well—data science education. This article mapped the growing body of research on data science education and found five major themes: (a) curriculum, (b) pedagogy, (c) STEM skills, (d) domain adaptation, and (e) social aspects.
While the curricular aspects of data science draw most of the attention, research on pedagogical issues attracts less attention (Mike, 2020). Nevertheless, the large body of literature on data science education and the framework of research topics revealed here indicate the birth of a new scientific discipline—data science education.
This research was supported by a VATAT grant to the Technion’s Artificial Intelligence Hub (Tech.AI).
Adhikari, A., & Jordan, M. I. (2021). Interleaving computational and inferential thinking in an undergraduate data science curriculum. Harvard Data Science Review, 3(2). https://doi.org/10.1162/99608f92.cb0fa8d2
Ammar, W., Groeneveld, D., Bhagavatula, C., Beltagy, I., Crawford, M., Downey, D., Dunkelberger, J., Elgohary, A., Feldman, S., Ha, V., Kinney, R., Kohlmeier, S., Lo, K., Murray, T., Ooi, H.-H., Peters, M., Power, J., Skjonsberg, S., Wang, L., Wilhelm, C., Yuan, Z., van Zuylen, M., & Etzioni, O. (2018). Construction of the literature graph in semantic scholar. ArXiv. https://arxiv.org/abs/1805.02262
Barboza, L., & Teixeira, E. S. (2020). Effect of data science teaching for non-STEM students: A systematic literature review. In L. Lavazza, R. Oberhauser, M. Herwig, & K. Kavi (Eds.), ICSEA 2020: The Fifteenth International Conference on Software Engineering Advances (pp. 118–122). IARIA. https://www.researchgate.net/profile/Luigi-Lavazza/publication/346965175_ICSEA_2020_The_Fifteenth_International_Conference_on_Software_Engineering_Advances/links/5fd4d34045851553a0af3f64/ICSEA-2020-The-Fifteenth-International-Conference-on-Software-Engineering-Advances.pdf#page=129
Biehler, R., & Schulte, C. (2018). Perspectives for an interdisciplinary data science curriculum at German secondary schools. In R. Biehler, L. Budde, D. Frischemeier, B. Heinemann, S. Podworny, C. Schulte, & T. Wassong (Eds.), Paderborn Symposium on Data Science Education at School Level 2017: The collected extended abstracts (pp. 2–14). Universitätsbibliothek Paderborn. https://www.telekom-stiftung.de/sites/default/files/files/PaderbornSymposiumDataScience2017_0.pdf
Cassel, B., & Topi, H. (2016, July 27). Strengthening data science education through collaboration [Workshop report]. National Science Foundation. https://digital.library.villanova.edu/Item/vudl:622682
Cohan, A., Feldman, S., Beltagy, I., Downey, D., & Weld, D. S. (2020). Specter: Document-level representation learning using citation-informed transformers. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2270–2282). Association for Computational Linguistics. https://arxiv.org/pdf/2004.07180
Danyluk, A., & Leidig, P. (2021). Computing competencies for undergraduate data science curricula. ACM. https://www.acm.org/binaries/content/assets/education/curricula-recommendations/dstf_ccdsc2021.pdf
De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C., Bryant, L., Cheng, L. Z., Francis, A., Gould, R., Kim, A. Y., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D., Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., … Ye, P. (2017). Curriculum guidelines for undergraduate programs in data science. Annual Review of Statistics and Its Application, 4(1), 15–30. https://doi.org/10.1146/annurev-statistics-060116-053930
EDISON. (n.d.). Building the data science profession. Retrieved March 2, 2022, from https://edison-project.eu/.
EDISON. (2017, July 3). EDISON Data Science Framework (EDSF) (Release 2). Retrieved March 2, 2022, from https://edison-project.eu/edison/edison-data-science-framework-edsf/.
Fayyad, U., & Hamutcu, H. (2020). Toward foundations for data science and analytics: A knowledge framework for professional standards. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.1a99e67a
Fisher, N., Anand, A., Gould, R., Bailer, J., Hesterberg, T., Bailey, J., Ng, R., Burr, W., Rosenberger, J., Fekete, A., Sheldon, N., Gibbs, A., & Wild, C. (2019, September). Curriculum frameworks for introductory data science. IDSSP. http://www.idssp.org/files/IDSSP_Data_Science_Curriculum_Frameworks_for_Schools_Edition_1.0.pdf
Foster, M., & Tasnim, Z. (2020). Data science and graduate nursing education: A critical literature review. Clinical Nurse Specialist, 34(3), 124–131. https://doi.org/10.1097/nur.0000000000000516
Glaser, B. G., & Strauss, A. L. (2017). Discovery of grounded theory: Strategies for qualitative research. Routledge. https://doi.org/10.4324/9780203793206
Gould, R., Suyen, M.-M., James, M., Terri, J., & LeeAnn, T. (2018). Mobilize: A data science curriculum for 16-year-old students. IASE. https://iase-web.org/icots/10/proceedings/pdfs/ICOTS10_9B1.pdf?1531364299
Gurcan, F., & Cagiltay, N. E. (2019). Big data software engineering: Analysis of knowledge domains and skill sets using LDA-based topic modeling. IEEE Access, 7, 82541–82552. https://doi.org/10.1109/ACCESS.2019.2924075
Gusenbauer, M., & Haddaway, N. R. (2020). Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Research Synthesis Methods, 11(2), 181–217. https://doi.org/10.1002/jrsm.1378
Haas, L., Hero, A., & Lue, R. A. (2019). Highlights of the National Academies Report on “Undergraduate Data Science: Opportunities and Options.” Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.38f16b68
Hamzah, A., Hidayatullah, A., & Persada, A. (2020). Discovering trends of mobile learning research using topic modelling approach. LearnTechLib. https://www.learntechlib.org/p/217838/
Hartigan, J. A. (1975). Clustering algorithms. John Wiley & Sons.
Hazzan, O., & Mike, K. (2021). A journal for interdisciplinary data science education. Communications of the ACM, 64(8), 10–11. https://doi.org/10.1145/3469281
Heinemann, B., Opel, S., Budde, L., Schulte, C., Frischemeier, D., Biehler, R., Podworny, S., & Wassong, T. (2018). Drafting a data science curriculum for secondary schools. In M. Joy & P. Ihantola (Eds.), Proceedings of Koli Calling ’18: 18th Koli Calling International Conference on Computing Education Research (Article 17). ACM. https://doi.org/10.1145/3279720.3279737
Joachims, T. (1996). A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. School of Computer Science, Carnegie Mellon University. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=c52eb66e23b201cb44f567cbb270feadca532c9a
Kitchenham, B. (2004). Procedures for performing systematic reviews. Keele, UK, Keele University, 33(2004), 1–26. https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=29890a936639862f45cb9a987dd599dce9759bf5
Lupusoru, R., Hategan, R. N., & Lungeanua, D. (2021). Approaches and tools for teaching biomedical data science during the COVID-19 pandemic: A systematic literature review. Applied Medical Informatics, 43(Suppl. S1), 36. https://www.proquest.com/openview/bbe1c17b58ed77475bec6fdef07b70ad/1?pq-origsite=gscholar&cbl=54733
MacGillivray, H. (2021). Statistics and data science must speak together. Teaching Statistics, 43(S1), S5–S10. Wiley Online Library. https://doi.org/10.1111/test.12281
Martín-Martín, A., Orduna-Malea, E., Thelwall, M., & López-Cózar, E. D. (2018). Google Scholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subject categories. Journal of Informetrics, 12(4), 1160–1177. https://doi.org/10.1016/j.joi.2018.09.002
Martins, R. M., & Gresse Von Wangenheim, C. (2022). Findings on teaching machine learning in high school: A ten-year systematic literature review. Informatics in Education. https://infedu.vu.lt/journal/INFEDU/article/742/file/pdf
Mike, K. (2020). Data science education: Curriculum and pedagogy. In A. Robins, A. Moskal, & A. J. Ko (Eds.), ICER '20: Proceedings of the 2020 ACM Conference on International Computing Education Research (pp. 324–325). ACM. https://doi.org/10.1145/3372782.3407110
Mike, K., Hazan, T., & Hazzan, O. (2020). Equalizing data science curriculum for computer science pupils. In N. Falkner & O. Seppala (Eds.), Koli Calling ’20: Proceedings of the 20th Koli Calling International Conference on Computing Education Research (Article 20). ACM. https://doi.org/10.1145/3428029.3428045
Mozgai, S., Kaurloto, C., Winn, J., Leeds, A., Heylen, D., Hartholt, A., & Scherer, S. (2022). Machine learning for semi-automated scoping reviews. SSRN. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4218678
National Academies of Sciences, Engineering, Medicine. (2018). Envisioning the data science discipline: The undergraduate perspective. Interim report. National Academies Press. https://www.nationalacademies.org/our-work/envisioning-the-data-science-discipline-the-undergraduate-perspective
Nightingale, A. (2009). A guide to systematic literature reviews. Surgery (Oxford), 27(9), 381–384. https://doi.org/10.1016/j.mpsur.2009.07.005
Paek, S., & Kim, N. (2021). Analysis of worldwide research trends on the impact of artificial intelligence in education. Sustainability, 13(14), Article 7941. https://www.mdpi.com/2071-1050/13/14/7941
Rosenberg, J. M., & Jones, R. S. (2022). A secret agent? K-12 data science learning through the lens of agency. EdArXiv. https://edarxiv.org/eyzkv/download?format=pdf
Sharma, D., Kumar, B., & Chand, S. (2019). A trend analysis of machine learning research with topic models and Mann-Kendall test. International Journal of Intelligent Systems and Applications, 11(2), 70–82. https://www.mecs-press.org/ijisa/ijisa-v11-n2/IJISA-V11-N2-8.pdf
Shulman, L. S. (1986). Those who understand: Knowledge growth in teaching. Educational Researcher, 15(2), 4–14. https://doi.org/10.2307/1175860
Syakur, M., Khotimah, B., Rochman, E., & Satoto, B. D. (2018). Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering, 336(1), Article 012017. https://doi.org/10.1088/1757-899X/336/1/012017
Tandjung, T. D., & Fudholi, D. H. (2022). Topic modeling with latent-Dirichlet allocation for the discovery of state-of-the-art in research: A literature review. Journal of Harbin Institute of Technology, 54(8), 335–341. http://hebgydxxb.periodicales.com/index.php/JHIT/article/view/1266
Torres-Carrión, P. V., González-González, C. S., Aciar, S., & Rodríguez-Morales, G. (2018). Methodology for systematic literature review applied to engineering and education. In 2018 IEEE Global Engineering Education Conference (EDUCON) (pp. 1364–1373). IEEE. https://ieeexplore.ieee.org/document/8363388
Vayansky, I., & Kumar, S. A. (2020). A review of topic modeling methods. Information Systems, 94, Article 101582. https://doi.org/10.1016/j.is.2020.101582
Wu, H. (2017). Systematic study of data science and analytics programs [Presentation]. 2017 ASEE Annual Conference and Exposition Proceedings. IUPUI ScholarWorks. https://scholarworks.iupui.edu/items/792ef41c-0d22-4694-8d55-500cfd898226
Yamazaki, Y., Suzuki, T., Kumar, A., Siswoyo, A., Reserva, R., Imai, M., Miyashiro, D., & Umemura, K. (2022). An efficient ‘paper mining’ system to search academic papers using SPECTER model. SSRN. https://dx.doi.org/10.2139/ssrn.4191461
Repository | Query |
---|---|
ACM-DL | (Title: "data science" OR Abstract: "data science") AND (Title: education OR Title: curriculum OR Title: pedagogy OR Title: teach OR Abstract: education OR Abstract: curriculum OR Abstract: pedagogy OR Abstract: teach) |
Google Scholar | “data science” education “data science” curriculum “data science” pedagogy “data science” teach |
IEEE Xplore | (("Document Title": "data science") OR ("Abstract": "data science")) AND (("Document Title": education) OR ("Document Title": curriculum) OR ("Document Title": pedagogy) OR ("Document Title": teach) OR ("Abstract": education) OR ("Abstract": curriculum) OR ("Abstract": pedagogy) OR ("Abstract": teach)) |
Semantic Scholar | data science education data science curriculum data science pedagogy data science teaching |
Web of Science | (TI="data science" AND (TI=education OR TI=curriculum OR TI=pedagogy OR TI=teach)) OR (AB="data science" AND (AB=education OR AB=curriculum OR AB=pedagogy OR AB=teach)) |
©2023 Koby Mike, Benny Kimelfeld, and Orit Hazzan. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.