This article explores how to deploy data science and data-driven AI, focusing on the broad collection of considerations beyond those of statistics and machine learning. Building on an analysis rubric introduced in a 2022 textbook, Data Science in Context: Foundations, Challenges, and Opportunities (Spector et al., 2022), this article summarizes some of the book’s key points and adds reflections on AI’s extraordinary growth and societal effects. The article also discusses how to balance inevitable trade-offs and provides further thoughts on societal implications.
Keywords: data science, artificial intelligence, context, data science education, societal concerns, regulation
Norvig, Wiggins, Wing, and I wrote our textbook, Data Science in Context: Foundations, Challenges, and Opportunities (DSiC), to illustrate the breadth of considerations needed to apply data science effectively (Spector et al., 2022). Each of us recognized that although the technical details of statistics and machine learning are central to the field, the many surrounding issues, such as choosing the right objectives and achieving dependability, are often more complex.
DSiC’s centerpiece is a rubric intended to help data science researchers and practitioners evaluate their work comprehensively and make trade-offs among competing goals. To aid in the latter, the book includes a practical ethical framework. It also explains how societal concerns arise from the field's fundamental challenges and makes several policy recommendations. Overall, we aimed at a holistic discussion of data science.
DSiC’s motivations are similar to HDSR’s—that is, “to be at the nexus of computer science, statistics, and related disciplines” (Siliezar, 2021). In this article, we explore this nexus:
Sections 2 to 4 summarize the book's central definition, the Analysis Rubric (including the importance of context), and explore several motivating examples. These sections also contain several updates and new explanations.
Section 5 shares primarily new material discussing the relationship between data science and artificial intelligence, seeks to explain the significant overlap in the fields, and describes how to apply the Analysis Rubric to AI.
Section 6, containing both summary and new material, describes the challenges in the fields and provides a way of framing the inherent trade-offs arising from the application of data science and AI.
Section 7 updates DSiC's discussion of societal concerns and actions in light of the dramatic growth of generative AI. For other points of view, see the HDSR special issue “Future Shock: Grappling With the Generative AI Revolution.”
Section 8 summarizes and contains closing thoughts particularly relevant to a data science readership.
DSiC defined data science (Excerpt 1) with the explicit intent to capture the field’s two key objectives: to (1) provide insights and (2) draw conclusions—sometimes called ‘making decisions.’ Automated conclusions are part of data science but are the primary focus of the ever-growing data-driven AI community.
Data science is the study of extracting value from data – value in the form of insights or conclusions.
|
This definition of data science is consistent with others proposed in this journal and elsewhere (Chang & Grady, 2019; Wing, 2019). For three reasons, it also specifies various forms of conclusions:
First, the term ‘conclusions,’ if not defined by example, would be too general to convey meaning to people who do not already understand data science.
Second, the types of conclusions are important and independently studied foci of scholarship and utility.
Third, each item connotes specific challenges, which show the breadth of needed techniques and considerations.
The list of conclusions will almost certainly grow as data science’s capabilities and impacts increase. For example, it may make sense to add Planning—a sequence of actions to move one state to another—as generative models and reinforcement learning prove useful in this domain. While one might feel a ‘plan’ has prediction, transformation, and optimization characteristics, planning is a field with the same academic credentials and impact as the enumerated others.
From a practical perspective, data science of all forms requires a focus not just on analysis, insight, and conclusions but on all aspects of its application, from gaining data to promulgating results in valuable ways. For example, DSiC states: “Making it easy and natural for users to create data is part of quality data science. This is not just user-interface design – it is more of an application design problem to ensure users want to participate and create quality data” (2022, Section 8.1). Duan et al. (2022) discuss similar issues that arise even with small data.
Checklists are a well-established mechanism for ensuring that individuals and systems include all the technical criteria needed before undertaking an action, whether making a medical intervention, flying a plane, or releasing a product (Gawande, 2009). A rubric is a term highly related to a checklist, but it connotes that its elements are somewhat more general and less able to be assigned a ‘yes’/‘no’ answer. Instructors often use rubrics to describe the goals of individual courses or broader fields of study.
DSiC proposes a rubric to help data science practitioners address the breadth of issues needed to achieve good results. As summarized in Excerpt 2, our Analysis Rubric has seven elements that arise from the challenges in real-world applications. It is applicable to both the ‘insight’ and ‘conclusion’ personalities of data science and is also relevant to data-driven AI. Collectively, the rubric elements highlight the complex trade-offs required to achieve practical, valuable, and beneficial results:
Three of the elements relate to technique and/or implementation.
Three relate to requirements—specifying what is needed for a result to be useful.
One is broader and concerns ethical and societal topics.
Analysis Rubric Implementation-Oriented Elements
Requirements-Oriented Elements
Ethical, Legal, and Societal Implications (ELSI) Element
|
While the rubric specifies elements that require attention, there is no specified order in which a data science or AI project must consider them. Individuals and teams using the rubric would typically visit and revisit the various elements over time, mirroring that most engineering design is an iterative combination of top-down and bottom-up processes.
While this article cannot explore each element of the Analysis Rubric in depth, it can exemplify their subtleties by discussing one of them: the topic of privacy. Beyond the immediate recognition that ‘Of course, privacy is a crucial concern,’ the rubric sub-element forces us to consider what the term really means and all the issues underlying it.
Most obviously, privacy discussions refer to policies and related techniques for vouchsafing the confidentiality of personal data. In addition to addressing the benefits of this, the topic also requires us to consider the direct and indirect costs of protecting data. For example, these may slow down progress in medical research.
However, topics in privacy go beyond confidentiality and must address the use of personal data in confidentiality-preserving ways. For example, can or should data be used to enable a system to make personalized recommendations to an individual, some of which may seem beneficial and others manipulative? Should an individual’s data enable a system to learn aggregate lessons (with the proviso that individual confidentiality is guaranteed)—perhaps ones that are beneficial to science or public policy or alternatively ones that might further an agenda the individual does not like? This rubric argues for explicit consideration of these issues, as well as the reality that individuals may desire the release of their private information in ways under their control.
In aggregate, the privacy element catalyzes the analysis and proper conclusions that balance individual and societal preferences, the value that data can provide, and the trade-offs with other rubric concerns, such as security. DSiC organizes this discussion of privacy centering on five actions: (1) collection of data, (2) storage of data, (3) maintenance of confidentiality, (4) uses of an individual’s data solely for that individual (personalization or targeting) without any dissemination, and (5) permitted use of anonymized information to gain aggregate knowledge (DSiC, 2022, Section 10.1). I acknowledge that some might include other topics under privacy (e.g., how profit on data might accrue) that DSiC includes under other rubric items (e.g., ELSI).
The previous discussion, while summarizing just one small part of the rubric, provides a flavor of the topics that it encourages researchers and practitioners to consider. It also illustrates that context is essential in applying data science and AI, just as it is in most fields: for example, from political philosophy (e.g., in jurisprudence) to engineering (e.g., in specifying system reliability requirements) to statistics (e.g., in Bayesian statistics, information portrayal, and importantly education).
More specifically, one can evaluate the rubric elements only in reference to enveloping technical, humanistic, ethical, political, and economic considerations. For example, concerning privacy, any evaluation must contextually consider the individual, society, time period, specific data, proposed use, and more. Context is so vital that DSiC includes the word in its title to emphasize the importance of four contextual dimensions affecting the practice of the field:
Data science, and indeed AI, must consider the context in which the data and technical models operate. Examples: What requirements are there for dependability? What is the need for interpretability or demonstration of causality? What are the precise objectives to be met?
The use of context emphasizes that requirements are often specific to a particular application, domain, or use. In some cases, there is a societal necessity to minimize certain failures and regulate operations, but there may be no concerns for benign uses. The application context also limits what approaches will work. For example, the confidentiality-enhancing techniques of differential privacy can be terrific in some domains but problematic in others (Gong et al., 2022). Tufte (2001, p. 77) highlights the importance of context in his sixth principle.
The use of context refers to the societal views and norms about the desired results or objectives. In particular, what constitutes the ‘right’ result will be subject to great debate now that practitioners are applying data science and AI to some of the world’s most challenging problems. What are deemed the best solutions will vary in different circumstances, societies, and so on.
Finally, context is temporally dependent and fluid. Requirements change for many reasons: technology’s breadth of impact may grow, and new knowledge may amplify our understanding of its effects. Public and political views often change with little rationale, and practitioners must be accommodative.
A rubric is more comprehensible when explained by example. DSiC has tens of examples from the domains of Transport & Mapping, Web & Entertainment, Medicine & Health, Science, Financial Services and Society, and Politics and Government, and it could have included even more. This summary reprises three from DSiC (2022, Chapter 6): one where data science is easy to apply and two where it is hard. (See Table 1.)
The first rubric example addresses the indication of traffic speed on maps—a highly used and well-functioning application of aggregated, personal location data. However, even in this natural application, there are subtleties: Is the objective to show traffic at present or when a driver is likely to arrive at the region? Is the application sufficiently resilient to show road closures rather than roads without slow vehicles? What about the impact of a systemwide outage on a population that has become dependent?
The second example, applying data to understand the cause of disease, is far more difficult. Among the challenges is the combinatorial complexity due to the many possible causes that may operate individually or jointly. Notably, the word ‘cause’ is in the description, and that is frequently a tell-tale sign that data science may not be the complete answer, as it is much easier to merely show correlations. However, advances in causal inference are making it easier to demonstrate causality from observational data (Rosenbaum, 2023), and statistical techniques are indispensable in properly designing experimental methods. Beyond causality, this example has additional challenges in each rubric element.
The third example concerns autonomous driving in automobiles, a topic of increased attention now that autonomous taxis are both appearing and being withdrawn. Against the rubric, we can ask if current techniques can sufficiently generalize from many training examples to react with sufficient resilience to rarely occurring situations (Chen et al., 2023). There are many additional complex issues, such as whether even the proper driving speed is sufficiently clear, given the conflict between prevailing traffic flow, speed limits, and passenger desire. Certainly, we can question how dependable is dependable enough.
Inevitably, these are but sketches of the rubric's detailed application, and the reader is encouraged to consider additional details relevant to each example.
Data science and AI have much in common, particularly as AI has increasingly adopted data-driven methods and moved from being a research endeavor to having a vast practical and societal impact. There is a growing overlap in both fields’ communities of practice and a need for each to learn from the other. Given the extraordinary growth of AI, it is of growing importance to understand the relationship of the fields.
In some ways, data science is a broader term than AI, because it aims to provide both insight and (automated) conclusions. At the same time, AI is less concerned with generating statistics, showing visualizations, or testing hypotheses. On the other hand, AI is broader because its trajectory is influenced by the 1956 Dartmouth Conference’s agenda, which addressed how computers could perform tasks characteristic of human intelligence (McCarthy et al., 1955). Moreover, while data-driven techniques are dominant today, AI uses other techniques (e.g., control theory in robotics) and could possibly have a resurgence of non-data-driven methods.
While there are other Venn diagrams illustrating AI and data science (Coursera, 2023; Fayyad & Hamutcu, 2020), Figure 1 graphically shows the essential relationships between AI, machine learning, and data science in keeping with this section’s topic.
Machine learning (top center) has become central to both fields as a core method and a research subject, though both fields also use other techniques. The small slivers on the edges of the circle emphasize that AI and data science may sometimes focus on different aspects of machine learning.
The AI (left circle) unshared area includes topics such as symbolic logic, knowledge representation and reasoning, theorem proving, specific natural language processing techniques, alignment, robotic sensing, manipulation, and related algorithms.
The data science (right circle) unshared area includes topics in statistics, operations research, visualization, and more.
The overlapping area between data science and AI without machine learning (bottom overlap) is considerable and has grown as both fields solve more complex problems with increasing scale and impact. For example, as we apply both fields to further policy, economics, governance, health care, education, and more, they must address many of Section 3’s Analysis Rubric topics, for example, privacy, security, complexity in setting objectives, and so on.
The significant overlap between AI and data science implies that many data science materials and practices are relevant to AI. That applies to this journal as well as DSiC’s Analysis Rubric and related technical material. Of course, it also goes the other way: machine learning and AI provide ever more important tools in statistics and data science. More generally, this begs a difficult question: How should academic programs, industrial departments, and journals currently labeled as either AI or data science align themselves? The word ‘align’ is abstract, hiding within it many possible approaches.
As an indication of significant commonality, Section 4’s Analysis Rubric would seem applicable to the growing uses of AI as well as data science. Table 2 adds three examples to those of Section 4 to illustrate this:
In the first example, the many large language models and chatbots, such as ChatGPT, Claude, or Gemini, have considerable training data, and transformer-based models provide a remarkably flexible and fluid approach to conversational interaction. However, the rubric illustrates the technical challenges such as hallucinations, difficulties in resilience (e.g., jailbreaking), problems in understandability (e.g., proper citations), and anticipation of societal impacts.
The second example concerns how well data science can provide insight from the frequency of extreme weather events over time, given typical weather variability and limited or incomparable data. As examples, we would like to know if there is more variance in regional temperatures or if the number (or severity) of regional cyclones, droughts, and/or extreme rainfall events is increasing. Applying data science techniques to show statistically significant evidence of change is far more constructive than sensational news headlines. This is a difficult problem despite many statistical frameworks and considerable data (Knutson, 2022; Mudelsee, 2020; Seneviratne et al., 2012, Section 3.2.1). Of course, it is even more difficult to show causality (the relative causal impact of greenhouse gasses, deforestation, urbanization, or reduced SO2),
The final example relates to generative art, in which systems trained on a vast quantity of existing visuals and prose use generative models to create new results from text, images, and more (Gozalo-Brizuela & Garrido-Merchan, 2023). Is it possible to specify objectives with sufficient precision? Is the dependability high enough, particularly given fears over harmful results? Do current intellectual property regimes encourage artistic creativity while adequately accounting for the intellectual property value of the data (e.g., an artist’s style) on which systems are trained (Appel et al., 2023)?
In summary, the DSiC Analysis Rubric’s elements of (1) Tractable Data, (2) Technical Approach, (3) Dependability, (4) Understandability, (5) Clear Objectives, (6) Toleration of Failures, and (7) ELSI issues remain omnipresent in almost all discussions of applications of data science and AI.
The challenges suggested by the rubric elements range from obvious to subtle and from easily met to extremely hard. In the latter category, some exhibit great technical difficulty, while others have very difficult-to-agree-on, even contentious, objectives.
Here is a small sampling of the challenges:
The modeling and statistical complexity of inventing or choosing good models. Topics include handling nonstationarity, adjusting for data biases and anomalies, inductive bias, and many more (DSiC, 2022, Chapter 9).
The computing challenges associated with storing data (DSiC, 2022, Chapter 8), training and executing models at scale (DSiC, 2022, Section 9.3), and meeting various dependability requirements (DSiC, 2022, Chapter 10) are sometimes enormous. Among many others, these might be classic problems relating to the proper operation of mission-critical systems or the difficulty of creating resilient systems from the composition and interoperation of models that are only stochastically valid.
The challenges of understandability (why/how, causality, or reproducibility) and how best to meet them. Understandability is an issue not only for practitioners but for the public, which is deluged with both raw data and aggregate analyses, many of which are misleading (DSiC, 2022, Chapter 10).
Determining clear and beneficial objectives. Objectives are often in conflict (e.g., due to the needs of different constituencies) and may vary widely in different geographical and temporal contexts. Consequences may not occur for a while and are therefore hard to predict and measure; often, there is no consensus on balancing near-term and long-term results. Research and discussion of topics such as fairness (addressed by both the tractable data and clear objectives rubric elements) or human manipulation are examples of challenges (DSiC, 2022, Chapter 12).
Relationship of particular data science applications to societal laws, norms, and long-term objectives (DSiC, 2022, Chapter 14).
For a comprehensive discussion of many of these topics, see also Perlmutter et al. (2024).
The rapid technical progress and opportunities provided by transformers and generative models intensify the challenges provided by each of the items above. They have such great utility yet need more clarity on why they work and how they fail. Increased emphasis on them is certainly warranted as they have become perhaps the most well-advertised use of data-driven methods at extreme scale. Specifically, discussions of technical approaches to modeling should now address the benefits and pitfalls of prompt engineering, few-shot learning, and fine-tuning.
Analyzing the fields’ challenges makes it clear that achieving good results with data science and AI usually requires trade-offs. For example, privacy and security can be at odds: decisions to protect an individual or an institution’s privacy can make it more challenging to create secure applications and hinder law enforcement. Reflecting on the experiences with COVID-19, Leslie (2020) emphasizes the trade-off challenges, for example, writing that “the use of data-driven technologies may advance the public interest only at the cost of safeguarding certain dimensions of privacy and autonomy.”
As discussed above, objectives may also be challenging to agree on for other reasons: For example, it is difficult to distinguish between a helpful recommendation and a manipulative one. Similarly challenging is the balance between portraying historically representative images vs. ones perpetuating stereotypes. Fairness may mean different things to different people, and ostensibly reasonable objectives may provably conflict (Kleinberg et al., 2017).
DSiC emphasizes two principles to help achieve the best possible balance:
Integrity. “The clearest obligation of data scientists and their organizations is adherence to professional codes of ethics covering truthfulness, integrity, and similar issues” (DSiC, 2022, Section 14.3). In a more recent paper, I have written that “any discussion of trade-offs must start with integrity, which is the foundation (in computer science terminology, the ‘secure kernel’) for the proper conduct of all science and engineering endeavors. In data science and artificial intelligence, this has become ever more apparent as we confront the visible risks arising from the misuse and misrepresentation of data and extending to computing’s broader impacts. As professionals, we must disclose the limits of our art, practice lawful behavior, always tell the truth, and not misrepresent our conclusions or capabilities” (Spector, 2024, p. 36).
The Belmont Principles. These principles, established by the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research (1978), were initially developed to help govern human subject research. They have even more general value as a framework for balancing objectives. Belmont’s foundational ideas—Respect for Persons, Beneficence, and Justice “(1) reminds data scientists to think about difficult challenges, (2) acts as a check on significant errors, and (3) motivates practical improvements” (DSiC, 2022, Section 7.5).
Experience since the book has recently led me to generalize from the Analysis Rubric and the above two points to propose a prescriptive Three-Part Framework for making good decisions (Spector, 2024). Specifically, the framework argues that all should:
Attend to field-specific needs inherent in data-driven approaches, as in the Analysis Rubric. Just as engineers and scientists must consider corrosion (and other topics) in structural engineering, metal fatigue in aeronautical engineering, or drug interactions and abuse in pharmacology, experts in data science and AI must pay due attention to each element of the topics in the Analysis Rubric.
Be scrupulous about integrity, as discussed above. There are many statistical and computational pitfalls to avoid, and all must be aware of the many temptations that may cause us to take shortcuts or oversell results.
Recognize that there are still complex trade-offs, particularly since we are dealing with many so-called wicked problems (Churchman, 1967), which have no exact or agreed-upon solution and which may differentially impact segments of society. Studying the Belmont Principles and applied ethics is a good starting point, but resolving our challenges requires a much broader knowledge base. More today than ever before, we need a background in the liberal arts (economics, political science, philosophy, history, literature, and more) to understand the implications of the powerful tools provided by data science and AI. In particular, ethics alone is not enough.
Arguments in favor of Part 3 arise from the pragmatic need to make good decisions. This requires leveraging history and fiction (to increase our understanding of possibilities and near- and long-term impacts), economic analyses, the study of human societies and relations, varied philosophical points of view, and more. However, Part 3 should also enable more creative decisions: Brodhead wrote, “the dynamism of a culture of innovation has come in large part from our distinctive liberal arts tradition, in which students are exposed to many different forms of knowledge and analysis, laying down a mental reservoir that can be drawn on in ever-changing ways to deal with the unforeseeable new challenges” (2023, p. 25). There are plenty of the latter challenges brewing as data science and AI progress.
From a practical perspective, data science and AI experts are directly responsible for the practice and education of Parts 1 and 2. However, Part 3 requires us to collaborate with professionals across the liberal arts to educate students and achieve the best possible results. We must recognize that we cannot expect to be expert in everything.
This discussion relates to the deep questions posed by recent HDSR articles: “Data Scientists Should Be Value-Driven, Not Neutral” (Clement-Jones, 2021), “Why the Data Revolution Needs Qualitative Thinking” (Tanweer et al., 2021), and “AI (Agnostic and Independent) Data Science: Is It Possible?” (Meng, 2023). The Three-Part Framework attempts a practical division between the agnostic and the value-driven.
As governed by this framework’s first and second elements, the proper practice of data science is agnostic or neutral, at least if we can assume a universal agreement on the need for integrity. As educators, researchers, or professionals, we can teach, study, and implement the constituent components of data science and artificial intelligence. For example, we can teach and implement many data-driven approaches to recommendation systems or chat responses, and we can further characterize each by the results they achieve and their implications on privacy or resilience. Similarly, we can sometimes use statistical techniques to show the uncertainty underlying our conclusions. Whatever specific trade-offs are then made, they should at least be on the Pareto frontier (“Pareto Front,” 2023), but the specific choices will often be value-driven—some as simply stated as profit vs. risk, but often far more complex.
This gets us to the framework’s third element, which is decidedly not neutral. Trade-offs will depend on the contexts listed earlier in this article: the specific application, societal norms, and the times. Decisions will vary regarding how much privacy, optimization, or statistical risk is appropriate. The framework does not take a stand on what balance is right but instead argues that decision-makers will not be solely data scientists or experts in AI but rather a coalition inevitably reflecting differing values.
Thus, this framework admits to the existence of fundamental differences and minimizes the likelihood that data science and artificial intelligence fields will partition into different fields based on differing societal norms.
The Analysis Rubric illustrates many root causes of societies’ concerns with AI and data science. However, societal concerns often arise from mixing rubric elements and manifest themselves in other terminology (see Table 32).
Here are explanations of some of the elements of the table:
There is growing concern over the trustworthiness of data science applications, particularly generative AI. Their ubiquity and increasing importance argue that they should clearly advertise their functions and objectives and be able to explain how they reach their conclusions. On the other hand, as written above, objectives are often difficult to agree upon, specify, and implement, and explainability is a challenging technical problem. For better or worse, precise specification opens providers up to legal challenges and litigation should their applications not work as advertised.
Many of the issues are giving rise to growing voices for AI regulation. The contextual view this article advocates argues strongly against the broad regulation of AI technology, but rather the (sectoral) regulation of its application. As examples, autonomous vehicles would be regulated primarily within the regime of motor vehicles, whereas AI-generated prose or imagery would be regulated, in part, based on libel, property rights, fraud, obscenity, and related laws. Consistent with this view, a recent MIT Policy Brief states, “The development, sale, and use of AI systems should, whenever possible, be governed by the same standards and procedures as humans” (Huttenlocher et al., 2023).
Generative AI has vastly increased power and cooling requirements over previous approaches, increasing environmental concerns, though near-term efficiency optimizations are likely due to better modeling techniques and new hardware architectures.
Concerns have increased over the potential impact on employment or the benefits that accrue to some groups and not others. Some are worried about the risks that the new data-driven technologies will result in long-term unemployment. However, they could beneficially counter demographic declines in the workforce or catalyze a productivity bonanza that will meet the long tail of human demand.
Many are concerned with the ability of data-driven technologies to create content. Some might consider this content as impinging on the intellectual property of others. The content might also be highly manipulative in ways that deceive individuals and societies. While the ability of data to misinform is not new (Huff, 1993), today’s scale may well be different and create new cat-and-mouse games with implications on criminal behavior, politics, and more. There is now a vast literature on this, but here are two surveys: Manheim & Kaplan (2018) and Perrault & Clark (2024).
Finally, the impact of AI on nation-state governance, sovereignty, and security is receiving much more attention (Kissinger et al., 2021). This has already catalyzed new regulations and economic policies. However, AI’s importance also amplifies the need for nation-states to innovate rapidly, which conversely tempers the pace of regulation (Larsen, 2022).
The recent concerns arising from generative AI are prominent in comments from signatories of the March 2023 “AI Pause Petition” (Bengio et al., 2023). Struckman and Kupiec (2023) interviewed some of them and concluded that most interviewees did not expect an actual pause to occur, perhaps because research and development are proceeding on a broad, global front. However, the signatories were undoubtedly concerned about a lack of focus on AI safety and AI’s effect on employment and misinformation.
Data Science in Context makes several recommendations to address societal concerns. In this article, I update the discussion of two of them because of the increasing societal discourse surrounding them.
The first recommendation is to ‘Broaden educational opportunities in data science.’ This seemed evident given the need for both practitioner and public understanding of the power and pitfalls of data and data-driven solutions. If anything, the growth in the value and challenges associated with generative AI makes this recommendation even more critical today, even if a more contemporary recommendation might now be phrased to ‘Broaden educational opportunities in data science and AI.’
In addition to discussing postsecondary education, DSiC addressed the issue of data science education in K-12—a complex problem for many reasons, but mainly because it is hard to fit new material into an already crowded curriculum. One promising approach is to infuse statistics, visualization, and at least simple programming into other classes (e.g., biology or civics) with likely bidirectional benefits to those classes and to teaching data science.
More controversially, DSiC also recommends that “many students should take a specific, rigorous data science class, which should replace calculus (despite its preeminence as one of the most beautiful modeling tools) or possibly some other parts of the high school mathematics curriculum”3 (DSiC, 2022, Section 6.1.1). Such a class would presumably include an introduction to models, some introductory probability and statistics; experience with easy-to-use programming tools; and case studies—both to enable hands-on learning and to educate students on the challenges in gaining understanding from data. New AI-enabled programming techniques and tools may contribute to both pedagogy and AI education itself. Experts have much more to add; for example, see (National Academies of Sciences, Engineering, and Medicine, 2023) as a starting point.
There have been disagreements on replacing calculus with such a course for at least two reasons: (1) calculus and other parts of the traditional high school math curriculum are needed for further advancement in some college programs; (2) students will be harmed without the rigor of mathematical discipline afforded by established courses. Countering the first objection, students likely to need calculus later (e.g., students who plan degrees in many STEM fields) could continue in the current curriculum and study data science in college. Countering the second, there is every reason to believe that introductory data science courses can be both rigorous and engaging. Thus, despite the give-and-take, many students will gain more from a better understanding of statistics, probability, and computational thinking than from high school’s more traditional upper-math courses. Further, they will be better prepared for some college degrees—and life. See Levitt and Conrad’s discussions for more (Conrad, 2023; Levitt, 2023). The reader can gain even more perspective on the challenges raised by this topic by noting that discussions on replacing calculus go back many decades; see, for example, Browne (1978).
With all the headlines concerning the regulation of AI, DSiC’s recommendation to “Regulate uses, not technology” warrants discussion, yet still seems primarily correct despite the rapidity and breadth of AI’s most recent impacts. For example, while hallucinations in generative AI are not desirable, their effects in decision-making systems powering autonomous driving are of a different magnitude than they are in a creative writing assistant. Thus, it makes sense that the enormous body of existing automotive regulations and liability laws will tightly regulate driving applications. On the other hand, any regulation of writing assistants would occur under a very different regime, given complex protections on freedom of speech, the press, and so on.
Despite the admonition to regulate data science and AI in application-specific ways, there may be some situations where a system’s internal use of artificial, rather than human, intelligence might need to factor into that regulation. This may be because of the economic impact of a discontinuous decrease in the cost of doing something. For example, courts may rule that machine learning cannot learn an artist’s style, but it is okay for humans to do so because automated systems can learn and duplicate that style so cheaply. Also, societies are likely to apply higher standards (e.g., driving safety) for automated systems than for human ones. There will also need to be clarity on where liability for harm rests, for example, the creator of an AI application, the data on which it is trained, its user, and so on.
As a final note on this topic, the European Union has proposed the use of auditing to enable regulators to evaluate data-driven systems, and its Digital Services Act has required compliance for service providers with more than 45 million EU users (European Commission, 2023). While this might seem beneficial initially, it has many risks: (1) The EU act applies only to large firms, so it could encourage the shift of problematic behavior to smaller ones; (2) audit creates security risks on potentially vast amounts of collected data; (3) the EU act also adds significant complexity to systems, particularly if the law is expanded to smaller companies as might well be needed; and (4) audit opens AI and data science applications to regulatory overreach, including post facto rules-making and political manipulation. At a minimum, care is needed before rushing to create increasing audit regulations.
This article has defined data science, respecting its twin objectives to provide insight and generate conclusions. Centrally, the article summarized a rubric whose seven elements enumerate the breadth of topics that can help us apply data science effectively. It has discussed the overlap between AI and data science and argued that the rubric also applies to AI. The rubric does not itself resolve all issues (e.g., debates on privacy vs. security or how much resilience is enough), so this article argues that the path forward also requires careful attention to integrity and then a careful analysis informed by many fields of study. The previous section’s Table 3 concludes with a taxonomy of societal concerns and examples of how to address them.
I close with three remaining topics that may be of particular interest to the HDSR readership:
As the field of data science addresses both its more statistical- and artificial intelligence–oriented personalities, those who consider themselves statisticians and data scientists should benefit from the progress in the data-driven AI world, and vice versa. Arbitrary field partitions should not impede education, collaboration, or progress.
Data-driven methods, whether in data science or AI, require a broad consideration of everything from creating project objectives (of all forms), collecting data, developing models, and evaluating the models’ outputs. Our students and practitioners need this breadth, but finding instructors with the requisite background is challenging. I note that our need to educate goes beyond the students who major in data science or related fields; it also includes policymakers, product managers, business majors, and so on. As educators, we need students to have a holistic grounding in the transdisciplinary nature of the field, while understanding that students will inevitably need to specialize in particular aspects of it. Some will have more of a focus on engineering, some on statistics and data analysis, some on other AI techniques, and many on hybridizing these fields and others.
As we wrote in our book, we must be humble and remember that data science does not automatically lead to more understanding. Harvard history professor Jill Lepore has expressed a view that “Data Killed Facts,” implying that data may obscure truth (Lepore, 2018). Instead,
To provide light and not just heat, data scientists need to be very careful to explain data’s provenance, limitations, and relationship to other data, to distinguish between correlation and causation, and to present our insight and conclusions in a way that will not prey upon the cognitive biases of its intended recipients.
We need to address the facts that (1) the sheer availability of data as well as (2) data science and AI’s insights and conclusions may result in misinterpretation or misuse—whether because of carelessness, negligence, or malfeasance. We also must educate the general public to be highly skeptical even when information is purportedly backed up by data. Educating practitioners to abide by the highest standards of integrity and quality is absolutely necessary.
The final essay in Data Science in Context, “Post-modern Prometheus,” suggested that data-driven methods are metaphorically like fire in the sense that both are very useful, yet both have enormous potential for misuse (DSiC, 2022, Section 20.6). The essay contends that if applications of data-driven techniques grow evolutionarily (say, over reasonably long periods), existing societal processes will mitigate the downsides, and benefits will be largely positive. However, if there is discontinuous change, we may need new processes and mechanisms to design our future societies. Such speed-ups and breakthroughs may happen, so coalitions of experts in data science, artificial intelligence, and the liberal arts should form to consider that future.
Chris Wiggins, Julia Lane, Cindy Bloch, and the HDSR reviewers and editors greatly added to this paper. I also acknowledge the enormous contributions to Data Science in Context from coauthors Peter Norvig, Chris Wiggins, and Jeannette Wing.
Alfred Z. Spector has no financial or non-financial disclosures to share for this article.
Appel, G., Neelbauer, J., & Schweidel, D. A. (2023, April 7). Generative AI has an intellectual property problem. Harvard Business Review. https://hbr.org/2023/04/generative-ai-has-an-intellectual-property-problem
Bengio, Y., Russell, S., Musk, E., & Wozniak, S. (2023). Pause giant AI experiments: An open letter. https://futureoflife.org/open-letter/pause-giant-ai-experiments/
Brodhead, R. H. (2023). On the tenth anniversary of The Heart of the Matter. Bulletin of the American Academy of Arts and Sciences, 76(3), 23–29.
Browne, M. W. (1978, April 23). Scientist suggests replacing calculus. The New York Times. https://www.nytimes.com/1978/04/23/archives/scientist-suggests-replacing-calculus-he-asserts-new-methods-based.html
Caldwell, S., Sweetser, P., O’Donnell, N., Knight, M. J., Aitchison, M., Gedeon, T., Johnson, D., Brereton, M., Gallagher, M., & Conroy, D. (2022). An agile new research framework for hybrid human-AI teaming: Trust, transparency, and transferability. ACM Transactions on Interactive Intelligent Systems, 12(3), Article 17. https://dl.acm.org/doi/10.1145/3514257
Chang, W. L., & Grady, N. (2019). NIST Big data interoperability framework: Volume 1, Definitions. https://www.nist.gov/publications/nist-big-data-interoperability-framework-volume-1-definitions?pub_id=918927
Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., & Li, H. (2023). End-to-end autonomous driving: Challenges and frontiers. ArXiv. https://doi.org/10.48550/arXiv.2306.16927
Churchman, C. (1967). Free for all — Wicked problems. Management Science, 14(4), B141–B146. https://doi.org/10.1287/mnsc.14.4.B141
Clement-Jones, T. F. (2021). Data scientists should be value-driven, not neutral. Harvard Data Science Review, 3(1). https://doi.org/10.1162/99608f92.39876cea
Conrad, B. (2023, October 2). California’s math misadventure is about to go national. The Atlantic. https://www.theatlantic.com/ideas/archive/2023/10/california-math-framework-algebra/675509/
Coursera. (2023, June 16). Data science vs. machine learning: What’s the difference? https://www.coursera.org/articles/data-science-vs-machine-learning
Duan, N., Norman, D., Schmid, C., Sim, I., & Kravitz, R. L. (2022). Personalized data science and personalized (N-of-1) trials: Promising paradigms for individualized health care. Harvard Data Science Review, (Special Issue 3). https://doi.org/10.1162/99608f92.8439a336
European Commission. (2023). The Digital Services Act package. https://digital-strategy.ec.europa.eu/en/policies/digital-services-act-package
Fayyad, U., & Hamutcu, H. (2020). Analytics and data science standardization and assessment framework. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.1a99e67a
Gawande, A. (2009). The checklist manifesto: How to get things right. Picador.
Gong, R., Groshen, E. L., & Vadhan, S. (2022). Harnessing the known unknowns: Differential privacy and the 2020 Census. Harvard Data Science Review, (Special Issue 2). https://doi.org/10.1162/99608f92.cb06b469
Gozalo-Brizuela, R., & Garrido-Merchan, E. C. (2023). ChatGPT Is not all you need: A state of the art review of large generative AI models. ArXiv. https://doi.org/10.48550/arXiv.2301.04655
Huff, D. (1993). How to lie with statistics. W. W. Norton & Company.
Huttenlocher, D, Ozdaglar, A., & Goldston, D. (2023). A framework for U.S. AI governance: Creating a safe and thriving AI sector. https://computing.mit.edu/wp-content/uploads/2023/11/AIPolicyBrief.pdf
Kissinger, H. A., Schmidt, E., & Huttenlocher, D. (2021). The age of AI: And our human future. Little, Brown and Company.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2017). Inherent trade-offs in the fair determination of risk scores. In C. H. Papadimitriou (Ed.), 8th Innovations in Theoretical Computer Science Conference (ITCS 2017) (Vol. 67, pp. 43:1–43:23). Leibniz International Proceedings in Informatics (LIPIcs). Schloss Dagstuhl—Leibniz-Zentrum fuer Informatik.
Knutson, C. L. (2022). Can we detect a change in Atlantic hurricanes today due to human-caused climate change? https://www.climate.gov/news-features/blogs/beyond-data/can-we-detect-change-atlantic-hurricanes-today-due-human-caused
Larsen, B. C. (2022). The geopolitics of AI and the rise of digital sovereignty. Brookings Research. https://www.brookings.edu/articles/the-geopolitics-of-ai-and-the-rise-of-digital-sovereignty/
Lepore, J. (2018). MLTalks: How data killed facts. Jill Lepore in Conversation with Andrew Lippman. MIT Media Lab.
Leslie, D. (2020). Tackling COVID-19 through responsible AI innovation: Five steps in the right direction. Harvard Data Science Review, (Special Issue 1). https://doi.org/10.1162/99608f92.4bb9d7a7
Levitt, S. (2023, October 13). Freakonomics author: “Objections to data science in K-12 education make no sense.” Fortune Magazine. https://fortune.com/2023/10/13/freakonomics-author-objections-data-science-k-12-education-math-tech-steven-levitt/
Manheim, K., & Kaplan, L. (2018). Artificial intelligence: Risks to privacy and democracy. Yale Journal of Law and Technology, 21, 106–188. https://yjolt.org/artificial-intelligence-risks-privacy-and-democracy
McCarthy, J., Minsky, M. L., Rochester, N., & Shannon, C. E. (1955). A proposal for the Dartmouth Summer Research Project on Artificial Intelligence. https://raysolomonoff.com/dartmouth/boxa/dart564props.pdf
Meng, X. L. (2023). AI (agnostic and independent) data science: Is it possible? Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.8a5f2975
Mudelsee, M. (2020). Statistical analysis of climate extremes. Cambridge University Press.
National Academies of Sciences, Engineering, and Medicine. (2023). Foundations of data science for students in grades K-12: Proceedings of a workshop. https://nap.nationalacademies.org/catalog/26852/foundations-of-data-science-for-students-in-grades-k-12
National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research. (1978). The Belmont report: Ethical principles and guidelines for the protection of human subjects of research. Department of Health, Education, and Welfare.
Pareto front. (2023,November 8) In Wikipedia. https://en.wikipedia.org/w/index.php?title=Pareto_front&oldid=1184190980
Perlmutter, S., Campbell, J., & MacCoun, R. (2024). Third millennium thinking: Creating sense in a world of nonsense. Little, Brown.
Perrault, R., & Clark, J. (2024). Artificial intelligence index report 2024. https://policycommons.net/artifacts/12089781/hai_ai-index-report-2024/12983534/
Rosenbaum, P. R. (2023). Causal inference. MIT Press.
Seneviratne, S., Nicholls, N., Easterling, D., Goodess, C., Kanae, S., Kossin, J., Luo, Y., Marengo, J. A., McInnes, K. L., Rahimi, M., Reichstein, M., Sorteberg, A., Vera, C. S., Contributors, A., Alexander, L. V., Allen, S. J., Benito, G., & Cavazos, T. (2012). Changes in climate extremes and their impacts on the natural physical environment. Columbia University. https://doi.org/10.7916/D8-6NBT-S431
Siliezar, J. (2021, April 5). Data Science Review wins PROSE award. The Harvard Gazette. https://news.harvard.edu/gazette/story/2021/04/harvard-data-science-review-wins-prose-award-for-best-new-journal-in-science/
Spector, A. Z. (2024). Gaining benefit from computing and data science: A three-part framework. Communications of the ACM, 67(2), 35–38. https://doi.org/10.1145/3624726
Spector, A. Z., Norvig, P., Wiggins, C., & Wing, J. M. (2022). Data science in context: Foundations, challenges, opportunities. Cambridge University Press.
Struckman, I., & Kupiec, S. (2023). Why they’re worried: Examining experts' motivations for signing the “pause letter.” ArXiv. https://doi.org/10.48550/arXiv.2306.00891
Tanweer, A., Gade, E. K., Krafft, P. M., & Dreier, S. (2021). Why the data revolution needs qualitative thinking. Harvard Data Science Review, 3(3). https://doi.org/10.1162/99608f92.eee0b0da
Tufte, E. R. (2001). The visual display of quantitative information (2nd ed.). Graphics Press.
Wing, J. M. (2019). The data life cycle. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.e26845b4
©2024 Alfred Spector. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.