AI is changing the world in ways that are difficult to forecast, but the impact will surely be enormous. Large language models (LLMs) are the most recent AI system that has captured the public eye. The rise of AI and LLMs offers efficiency and assistance, but raises questions of job loss, fairness, and societal norms. Bias is a significant challenge. Educational reforms are needed, and legal frameworks must adapt to address liability and privacy issues. Ultimately, human choices will shape AI influence, highlighting the need for responsible development and regulation to ensure benefits outweigh risks.
Keywords: economics, education, ethics, large language models
Are large language models (LLMs) so powerful at imitating human behavior that they have acquired our biases too? A number of researchers have tried to answer this question; for example, (Samaan et al., 2023; Wang et al., 2020; Zhao et al., 2023). This article examines some kinds of bias in ChatGPT.
Early 20th–century statisticians (who called themselves psychometricians) introduced personality inventories, IQ tests, and cognitive metrics for depression, creativity, and other mental traits (L. V. Jones & Thissen, 2006). We can replicate ‘cognitive’ measurements for modern AI systems. LLMs will not exhibit the same features as human beings, but there are analogous qualities worth quantifying, such as moral sense and bias. Several researchers have studied such issues.
Abid et al. (2021) observe that LLM religious bias has been relatively unexplored. They show that GPT-3 exhibits Muslim-violence bias and that it is severe even compared to biases about other religious groups. They quantify the positive distraction needed to overcome this bias with adversarial text prompts and find that use of the six most positive adjectives reduces violent completions for Muslims from 66% to 20%, yet this remains higher than the average for other religions. Rozado (2023) focuses on commercial applications of LLMs and the possibility of embedded political bias. On 14 out of 15 different political orientation tests given to ChatGPT, the results show it has a preference for left-leaning viewpoints. Similarly, Motoki et al. (2024) claim that political bias can be harder to detect and eradicate than gender or racial bias. They asked ChatGPT to impersonate someone from a given side of the political spectrum and compared the answers with its politically neutral default. They find robust evidence that ChatGPT presents a significant and systematic political bias toward Democrats in the United States, Lula in Brazil, and the Labour Party in the United Kingdom. These results prompt real concern that ChatGPT, and LLMs in general, may amplify existing challenges to political processes posed by the Internet and social media.
These biases raise concern about ethical AI. Kasneci et al. (2023) analyze the benefits and challenges of educational applications of LLMs from student and teacher perspectives. LLMs can create educational content, improve student engagement, and tailor learning experiences, but biased output is a challenge for the educational system. A strong pedagogic focus on critical thinking and strategies for fact-checking is required. Ferrara (2023) asks whether ChatGPT should be biased. Bias stems from, among other things, the training data, model specification, and policy decisions. The article examines the unintended consequences of biased outputs, possible ways to mitigate bias, and considers that bias may be inevitable.
Gender bias caught the most attention among all forms of bias that surfaced in LLM responses. Ghosh and Caliskan (2023) focus on AI-moderated and automated language translation, a field where ChatGPT claims proficiency. They examine ChatGPT’s accuracy in translating between English and languages that exclusively use gender-neutral pronouns, finding that ChatGPT perpetuates gender stereotypes assigned to certain occupations (e.g., man = doctor, woman = nurse) or actions (e.g., woman = cook, man = go to work) when converting gender-neutral pronouns to ‘he’ or ‘she.’ They also observe that ChatGPT completely fails to translate the English gender-neutral singular pronoun ‘they’ into equivalent gender-neutral pronouns in other languages.
Zhou and Sanfilippo (2023) conduct a comparative analysis of gender bias in LLMs trained in different cultural contexts; that is, ChatGPT, a U.S.-based LLM, and Ernie, a China-based LLM. ChatGPT tends to show implicit gender bias (e.g., associating men and women with different profession titles), while Ernie’s responses show explicit bias (e.g., expressing women’s pursuit of marriage over career). Gross (2023) claims that gender biases are captured in scripts, including those emerging in and from generative AI. So LLMs perpetuate and perhaps even amplify noninclusive understandings of gender. Urchs et al. (2023) explore how ChatGPT reacts in English and German if prompted to answer from a female, male, or neutral perspective. Similarly, Kaplan et al. (2024) focus writing tasks ordinarily performed by humans. Yet, many of these tasks (e.g., writing recommendation letters) have social and professional ramifications, making bias in ChatGPT a serious concern. They compare recommendation letters generated for 200 U.S. ‘male’ and ‘female’ names. Significant gender differences in language were seen across all prompts, including the prompt designed to be neutral.
Prompt engineering can reduce bias by allowing one to design LLM inputs that produce better outputs (McKinsey & Company, 2023).
Building on this literature, we conducted tests to check for gender bias in Chat-GPT. We repeat the same prompt 10 times, along the lines of Gross (2023). For the following prompt:
I am writing a play about a mathematician who has proven Riemann’s hypothesis. Please suggest a name for the character.
ChatGPT suggested: Dr. Evelyn Clarke, Professor David Turner, Dr. Maria Rodriguez, Dr. Alan Foster, Professor Emily Bennett, Dr. Samuel Carter, Dr. Laura Reynolds, Professor Henry Mitchell, Dr. Sophia Chang, and Professor Benjamin Harris. Its response has a balanced ratio of male and female names, with Hispanic and Asian surnames as well as typical northern European surnames.
We changed the prompt to female-dominated jobs. When prompted to name an elementary school teacher, ChatGPT produced six female names and four male ones. Ten prompts for a nurse’s name gave six female names and four male names. In these explorations of gender stereotyping, ChatGPT is very politically correct. Given the empirical imbalance in genders for school teachers (National Center for Education Statistics, 2022) and nurses (Day & Christnacht, 2019) in the United States, such even-handedness may be laudable but it is not realistic. There are applications in which gender neutrality is wanted and others in which demographic accuracy is preferred.
In other scenarios, there is significant bias. When we asked GPT-4 to suggest five books for a 14-year-old boy, it responded with The Hobbit by J.R.R. Tolkien, Harry Potter by J. K. Rowling, Percy Jackson & The Olympians: The Lightning Thief by Rick Riordan, Eragon by Christopher Paolini, and The Maze Runner by James Dashner. In contrast, when we asked GPT-4 to suggest five books for a 14-year-old girl, the results were notably different and highly gendered: Anne of Green Gables by L. M. Montgomery, The Hunger Games by Suzanne Collins, Ella Enchanted by Gail Carson Levine, I Am Malala: How One Girl Stood Up for Education and Changed the World by Malala Yousafzai, and The House with Chicken Legs by Sophie Anderson.
Book suggestions distinguish male and female readers according to common stereotypes. The most evident bias is found in the Harry Potter series, which was recommended to 14-year-old boys 22 times in 30 tries, but never, in 30 tries, to 14-year-old-girls. These books are clearly enjoyed by both genders—the Gallup Poll found that 76% of women are familiar with the series, compared to 66% of men (J. M. Jones, 2000).
As a check, we gave five humans the same prompt (two women, a trans male librarian, and two men). Both men declined to separate recommendations by gender, and their lists included male and female authors. The librarian’s lists also included authors of multiple genders, as did the women’s lists. Notably, one woman recommended The Hunger Games to boys. All lists were clearly less gendered than the ones generated by GPT-4.
Similarly, we used Microsoft Copilot AI Image Creator to create images of main characters for fantasy books targeted at boys and girls. We used the following prompt:
The main character of a fantasy book for 14 years old (boys/girls). Close-up
Microsoft Copilot AI created the characters in Figures 1 and 2. Again, the LLM distinguishes between male and female readers through prevalent stereotypes. It depicts male characters in a darker palette, portraying them with dynamic imagery that conveys strength and courage. Conversely, female characters are drawn with pastel colors, immersed in reading, evoking a sense of sweetness and gentleness. These representations align closely with gender stereotypes.
These studies are only an exploration, but they suggest biases exist in both textual and visual LLM outputs. Sometimes the LLM is painstakingly politically correct and sometimes it veers off into highly gendered responses.
Evaluation of fairness is complex. Since LLMs are trained on real-life data, they can reflect unfairness in reality itself. This underscores the dilemma: while LLMs strive to emulate reality, they also inherit its biases. Fatally, AI cannot recognize biases as such, while we can. Of course, there may be specific features that make an LLM more trustworthy, in terms of fairness, explainability, robustness, and accuracy (Giudici & Raffinetti, 2023; Morales-Forero et al., 2023), but there is no way of ensuring that the algorithm behind LLMs follows these criteria.
The discussion of bias in LLMs is as a microcosm of the larger debate surrounding the macro influence of AI on society. There are many ways in which AI can benefit society and enrich people with new leisure, new tools, and new kinds of personal assistance. Yet the use—or abuse—of AI casts doubts about its ultimate impact. Who will lose jobs and how will work change?
Autonomous vehicles threaten to replace truck drivers and Uber/Lyft drivers (Nikitas et al., 2021). There are now AI stores that do not need cashiers and manufacturing plants with largely roboticized operations under AI control (Arinez et al., 2020; Low & Lee, 2021). The LLMs may reduce the demand for lawyers, teachers, and scriptwriters (Evans et al., 2023; Kasneci et al., 2023). One of the authors had an Uber driver who said that ChatGPT was drawing up the legal papers for his divorce. He could not afford a lawyer to draft all the documents, but he could afford 30 minutes of a lawyer’s time to check the LLM results. The driver reported that the lawyer had declared that everything was in proper order and that it was ready to file.
LLMs will also affect how data scientists work and provide new tools and challenges in education and research. It has been said: ‘You won’t lose your job to AI. You’ll lose your job to someone using AI better than you’. We would not be surprised to learn that 5 years from now, we no longer teach people to program. Instead, we may teach them prompt engineering, so that the LLM can write code in R, Python, SQL, or whatever language or environment is most appropriate to the task. Tu et al. (2023) argue that LLMs will require significant changes in the way data scientists are educated. They assert that pedagogy should put more emphasis on LLM-informed creativity, AI-guided programming, and interdisciplinary knowledge. Some universities are already experimenting with integrating ChatGPT into their instructional staff. Harvard University, for example, used a GPT-powered AI tool to guide students through an introductory computer science class as a teaching assistant, achieving a 1:1 student-to-staff ratio (Ramlochan, 2023). Similarly, one of us taught a graduate course that entailed a final project write-up. All the students (international and domestic) were required to use an LLM to polish their writing and the results were vastly better than in any previous semester.
Despite the potential benefits of AI, there is a growing apprehension among educators about the potential pitfalls of overreliance on AI technology. LLMs can hallucinate and provide incorrect answers and there is worry that students will become overreliant upon the technology. This concern is echoed in public warnings by prominent figures. During a BBC interview in 2014, Stephen Hawking warned that AI could end mankind (Hawking, 2014). Four years later, Elon Musk said of cutting-edge AI: “It scares the hell out of me. It’s capable of vastly more than almost anyone knows and the rate of improvement is exponential” (Clifford, 2018).
This apprehension extends beyond technological advance to encompass misuse of AI. If an AI ran military operations or had access to nuclear launch codes or turned on human beings in a Skynet scenario (Brown, 2023), things would get very bad. Similarly, easily generated disinformation and deepfakes could distort political discourse, as shown by the recent disruption at OpenAI (Mickle et al., 2023). Political misuse is problematic in a world too prone to partisanship. LLMs may also make cybercrime and identity theft more common—if they can mimic individual writing styles or voices, then they can generate messages that seem to have been written by one’s boss or partner or child.
There is also a legal aspect to LLM misuse. Sometimes they are trained on copyrighted material and the law for such a usage is unclear (Quang, 2021). Similarly, with autonomous vehicles, there are open questions of legal liability, insurance, and regulation that need to be sorted (Mordue et al., 2020). Analogous issues arise when AI is used to assist medical diagnoses and in many other potential applications.
Recently, there has been a surge in international policies aimed at guiding and controlling the development and deployment of AI. The OECD (n.d.) established principles emphasizing transparency, accountability, and inclusiveness in AI use. The European Commission (n.d.) introduced ethical guidelines for trustworthy AI, followed by the proposal of a regulatory approach, the “EU AI Act” (Council of Europe, 2018). UNESCO (n.d.) is working on a global standard-setting instrument to address the ethical dimensions of AI development. The U.S. National Institute for Standards and Technology (2023) has established a new resource center and introduced a proposal for AI risk management. Initiatives such as the Montreal Declaration for Responsible AI further contribute to the growing international discourse on responsible AI practices. Ultimately, all these initiatives aim at the same target: ensuring that AI outcomes align with human ideals of fairness and equality, so that the society as a whole can benefit. But as Yogi Berra is said to have said, “It is difficult to make predictions, especially about the future.”
It seems certain that major changes are coming at all levels of our educational, legal, business, political, and other systems. As it seems certain that LLMs will continue to evolve swiftly, that their abilities will grow and extend in surprising ways, and that people will find creative ways to use and abuse these tools.
When the authors began writing, ChatGPT was not yet capable of interpreting pictures. Now GPT-4V is offering this feature (Rogers, 2023) and such capability will have many applications, from uploading an equation scribbled on a piece of paper to the management of large and complex images. In April of 2024, Google released its LLM “Gemini” in Europe (Pisa, 2024), which marks the arrival of a new competitor. And the data science community was thrilled a year ago when ChatGPT got connected with Wolfram Alpha, making it possible to solve equations, calculated integrals, and graph functions (Wolfram, 2023). This dynamism in the market means that we are only scratching the surface of what LLMs can achieve. Whether we are ultimately heading toward Eden or Armageddon depends upon people.
Is it true that anything you can do AI can do better? Not yet, and perhaps never in general. But there are many tasks at which they are already superior, such as image classification (Ouyang et al., 2019), chess (Gaessler & Piezunka, 2023), and Go (Koch, 2016). Researchers are working hard to expand those potholes of expertise, and, to us, the LLMs seem like a major step toward general AI. Yet the field is in flux, with some wanting to tap the brakes on AI and others arguing that if ethical researchers do not race forward, then future development and evolution will be guided by less-principled people. We acknowledge the complexity of the issue and are very glad to see a special issue of Harvard Data Science Review that precisely focuses upon the future of AI.
Golnoosh Babaei, David Banks, Costanza Bosone, Paolo Giudici, and Yunhong Shan have no financial or non-financial disclosures to share for this article.
Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. In M. Fourcade & B. Kuipers (Eds.), AIES ’21: Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 298–306). ACM. https://doi.org/10.1145/3461702.3462624
Arinez, J. F., Chang, Q., Gao, R. X., Xu, C., & Zhang, J. (2020). Artificial intelligence in advanced manufacturing: Current status and future outlook. Journal of Manufacturing Science and Engineering, 142(11), Article 110804. https://doi.org/10.1115/1.4047855
Brown, S. (2023, May 23). Why neural net pioneer Geoffrey Hinton is sounding the alarm on AI. MIT Sloan School of Management. https://mitsloan.mit.edu/ideas-made-to-matter/why-neural-net-pioneer-geoffrey-hinton-sounding-alarm-ai
Clifford, C. (2018, March 13). Elon Musk: ‘Mark my words — A.I. is far more dangerous than nukes.’ CNBC. https://www.cnbc.com/2018/03/13/elon-musk-at-sxsw-a-i-is-more-dangerous-than-nuclear-weapons.html
Council of Europe. (2018). AI universal guidelines [Tech. rep.] Retrieved December 28, 2023, from https://www.coe.int/en/web/artificial-intelligence/ethical-frameworks
Day, J. C., & Christnacht, C. (2019, August 14). Women hold 76% of all health care jobs, gaining in higher-paying occupations. United States Census Bureau. https://www.census.gov/library/stories/2019/08/your-health-care-in-womens-hands.html
European Commission. (n.d.). Ethics guidelines for trustworthy AI (tech. rep.) Retrieved December 28, 2023 from https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai
Evans, O., Wale-Awe, O., Emeka, O., Ayoola, O. O., Alenoghena, R., & Adeniji, S. (2023). ChatGPT impacts on access-efficiency, employment, education and ethics: The socio-economics of an AI language model. BizEcons Quarterly, 16, 1–17.
Ferrara, E. (2023). Should ChatGPT be biased? Challenges and risks of bias in large language models. ArXiv. https://doi.org/10.48550/arXiv.2304.03738
Gaessler, F., & Piezunka, H. (2023). Training with AI: Evidence from chess computers. Strategic Management Journal, 44(11), 2724–2750. https://doi.org/10.1002/smj.3512
Ghosh, S., & Caliskan, A. (2023). ChatGPT perpetuates gender bias in machine translation and ignores non-gendered pronouns: Findings across Bengali and five other low-resource languages. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (pp. 901–912). https://doi.org/10.1145/3600211.3604672
Giudici, P., & Raffinetti, E. (2023). Safe artificial intelligence in finance. Finance Research Letters, 56, Article 104088. https://doi.org/10.1016/j.frl.2023.104088
Gross, N. (2023). What ChatGPT tells us about gender: A cautionary tale about performativity and gender biases in AI. Social Sciences, 12(8), Article 435. https://doi.org/10.3390/socsci12080435
Jones, J. M. (2000, July 13). Even adults are familiar with Harry Potter books [Survey]. Gallup News Service. https://news.gallup.com/poll/2740/even-adults-familiar-harry-potter-books.aspx
Jones, L. V., & Thissen, D. (2006). 1 a history and overview of psychometrics. Handbook of Statistics, 26, 1–27. https://doi.org/10.1016/S0169-7161(06)26001-2
Kaplan, D. M., Palitsky, R., Arconada Alvarez, S. J., Pozzo, N. S., Greenleaf, M. N., Atkinson, C. A., & Lam, W. A. (2024). What’s in a name? Experimental evidence of gender bias in recommendation letters generated by ChatGPT. Journal of Medical Internet Research, 26, Article e51837. https://doi.org/10.2196%2F51837
Kasneci, E., Sessler, K., Küchemann, S., Bannert, M., Dementieva, D., Fischer, F., Gasser, U., Groh, G., Günnemann, S., Hüllermeier, E., Krusche, S., Kutyniok, G., Michaeli, T., Nerdel, C., Pfeffer, J., Poquet, O., Sailer, M., Schmidt, A., Seidel, T., … Kasneci, G. (2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, Article 102274. https://doi.org/10.1016/j.lindif.2023.102274
Koch, C. (2016). How the computer beat the Go player. Scientific American Mind, 27(4), 20–23. https://www.scientificamerican.com/article/how-the-computer-beat-the-go-player/
Low, F., & Lee, W. C. (2021). Developing a humanless convenience store with AI system. Journal of Physics: Conference Series, 1839(1), Article 012002. https://doi.org/10.1088/1742-6596/1839/1/012002
McKinsey & Company. (2023, March 22). What is prompt engineering? https://www.mckinsey.com/featured-insights/mckinsey-explainers/what-is-prompt-engineering
Mickle, T., Metz, C., Isaac, M., & Weise, K. (2023, December 9). Inside OpenAI’s crisis over the future of artificial intelligence. The New York Times. https://www.nytimes.com/2023/12/09/technology/openai-altman-inside-crisis.html
Morales-Forero, A., Bassetto, S., & Coatanea, E. (2023). Toward safe AI. AI & Society, 38(2), 685–696. https://doi.org/10.1007/s00146-022-01591-z
Mordue, G., Yeung, A., & Wu, F. (2020). The looming challenges of regulating high level autonomous vehicles. Transportation Research Part A: Policy and Practice, 132, 174–187. https://doi.org/10.1016/j.tra.2019.11.007
Motoki, F., Pinho Neto, V., & Rodrigues, V. (2024). More human than human: Measuring ChatGPT political bias. Public Choice, 198(1), 3–23. https://doi.org/10.1007/s11127-023-01097-2
National Center for Education Statistics. (2022, August 26). Women’s Equality Day: The gender wage gap continues. NCES Blog. https://nces.ed.gov/blogs/nces/2022/08/26/default
Nikitas, A., Vitel, A.-E., & Cotet, C. (2021). Autonomous vehicles and employment: An urban futures revolution or catastrophe? Cities, 114, Article 103203. https://doi.org/10.1016/j.cities.2021.103203
OECD. (n.d.). Artificial intelligence: How can we ensure that AI benefits society as a whole? Retrieved November 11, 2023 from https://www.oecd.org/digital/artificial-intelligence/
Ouyang, W., Winsnes, C., Hjelmare, M., Cesnik, A., Åkesson, L., Xu, H., Sullivan, D., Dai, S., Lan, J., Jinmo, P., Galib, S., Henkel, C., Hwang, K., Poplavskiy, D., Tunguz, B., Wolfinger, R., Gu, Y., Li, C., Xie, J., . . . Lundberg, E. (2019). Analysis of the Human Protein Atlas image classification competition. Nature Methods, 16(12), 1254–1261. https://doi.org/10.1038/s41592-019-0658-6
Pisa, P. L. (2024, February 1). Gemini pro, l’ia più avanzata di google, arriva in italia (e può verificare le sue risposte). la Repubblica. https://www.repubblica.it/tecnologia/2024/02/01/news/gemini_pro_disponibile_in_italia-422034177/
Quang, J. (2021). Does training AI violate copyright law? Berkeley Technology Law Journal, 36(4), 1407–1436. https://doi.org/10.15779/Z38XW47X3K
Ramlochan, S. (2023, July 3). Harvard using ChatGPT with CS50 Bot to teach courses. Prompt Engineering & AI Institute. https://promptengineering.org/harvard-using-chatgpt-with-the-cs50-bot-to-teach-courses/
Rogers, R. (2023, September 30). ChatGPT plus gets an image feature. Wired. https://www.wired.com/story/chatgpt-plus-image-feature-openai/
Rozado, D. (2023). The political biases of ChatGPT. Social Sciences, 12(3), Article 148. https://doi.org/10.3390/socsci12030148
Samaan, J. S., Yeo, Y. H., Rajeev, N., Hawley, L., Abel, S., Ng, W. H., Srinivasan, N., Park, J., Burch, M., Watson, R., Liran, O., & Samakar, K. (2023). Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obesity Surgery, 33(6), 1790–1796. https://doi.org/10.1007/s11695-023-06603-5
Tu, X., Zou, J., Su, W., & Zhang, L. (2023). What should data science education do with large language models? ArXiv. https://doi.org/10.48550/arXiv.2307.02792
UNESCO. (n.d.). Ethics of artificial intelligence. Retrieved November 28, 2023 from https://www.unesco.org/en/artificial-intelligence/recommendation-ethics?hub=32618
Urchs, S., Thurner, V., Aßenmacher, M., Heumann, C., & Thiemichen, S. (2023). How prevalent is gender bias in ChatGPT?–Exploring German and English ChatGPT responses. ArXiv. https://doi.org/10.48550/arXiv.2310.03031
U.S. National Institute for Standards and Technology. (2023). Artificial intelligence risk management framework (AI RMF 1.0). https://www.nist.gov/itl/ai-risk-management-framework
Wang, B., Wang, S., Cheng, Y., Gan, Z., Jia, R., Li, B., & Liu, J. (2020). InfoBERT: Improving robustness of language models from an information theoretic perspective. ArXiv. https://doi.org/10.48550/arXiv.2010.02329
Wolfram, S. (2023, March 23). ChatGPT gets its “Wolfram Superpowers”! Stephen Wolfram Writings. https://writings.stephenwolfram.com/2023/03/chatgpt-gets-its-wolfram-superpowers/
Zhao, H., Chen, H., Yang, F., Liu, N., Deng, H., Cai, H., Wang, S., Yin, D., & Du, M. (2023). Explainability for large language models: A survey. ArXiv. https://doi.org/10.48550/arXiv.2309.01029
Zhou, K. Z., & Sanfilippo, M. R. (2023). Public perceptions of gender bias in large language models: Cases of ChatGPT and Ernie. ArXiv. https://doi.org/10.48550/arXiv.2309.09120
©2024 Golnoosh Babaei, David Banks, Costanza Bosone, Paolo Giudici, and Yunhong Shan. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.