Red-teaming is an emergent strategy for governing large language models (LLMs), which borrows heavily from cybersecurity methods. Policymakers and developers alike have leaned heavily into this promising, yet largely unvalidated approach for regulating generative AI. We argue that AI red-teaming efforts address a particular and unique moderation need of LLM developers: scaling up human mischievousness by inviting a wide diversity of people to make the system misbehave in unsafe or dangerous ways. However, there are significant methodological challenges in connecting the practices of AI red-teaming to the broad range of AI harms that policymakers intend it to address. Caution is warranted as policymakers and developers invest significant resources into AI red-teaming.
Keywords: artificial intelligence, AI governance, red-teaming, large language models, generative AI, AI policy
12/13/2023: To preview this content, click below for the Just Accepted version of the article. This peer-reviewed version has been accepted for its content and is currently being copyedited to conform with HDSR’s style and formatting requirements.
©2023 Jacob Metcalf and Ranjit Singh. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.