Access to information about the data that used to train foundation AI models is vital for many tasks. Despite progress made by sections of the AI community, there remains a general lack of transparency about the content and sources of training data sets. Whether the result of voluntary initiative by firms or regulatory intervention, this has to change.
Keywords: artificial intelligence, machine learning, foundation models, training data, transparency, trust
Foundation models are trained on large collections of data, much of which is gathered from across the web.
The training of these models “depends on the availability of public, scrapable data that leverages the collective intelligence of humanity, including the painstakingly edited Wikipedia, millennia's worth of books, billions of Reddit comments, hundreds of terabytes’ worth of images, and more” (Huang & Siddarth, 2023). An investigation by the Allen Institute for AI and the Washington Post into the popular C4 training data set found that its content originated from 15 million different web domains (Schaul et al., 2023).
Knowing what is in the data sets used to train models and how they have been compiled is vitally important. Without this information, the work of developers, researchers, and ethicists to address biases or remove harmful content from the data is hampered. Information about training data is also vital to lawmakers’ attempts to assess whether foundation models have ingested personal data or copyrighted material. Further downstream, the intended operators of AI systems and those impacted by their use are far more likely to trust them if they understand how they have been developed.
However, in undertaking their analysis, Schaul et al. (2023) concluded that “many companies do not document the contents of their training data—even internally —for fear of finding personal information about identifiable individuals, copyrighted material and other data grabbed without consent.”
In public, companies have used different arguments to justify the lack of transparency around their training data. In documentation published at the launch of its GPT-4 model, OpenAI (2023) stated that it would not share detailed information about ‘data set construction’ and other aspects of the model’s development due to “the competitive landscape and the safety implications of large-scale models.” The decision not to disclose the data used to train the model was roundly criticized by a number of leading researchers (Xiang, 2023). A recent op-ed in the Guardian argued that companies are using ‘speculative fears’ to “stop people asking awkward questions about how this particular technological sausage has been made” (Naughton, 2023).
Even when companies have published the training data they have used, they have tended to only publish ‘fine tuning’ data (Mitchell, 2023). This is important, as it is the larger, messier training data sets that are most likely to include harmful content or copyrighted material. It is also a reminder not to see training data as a singular, static artefact. Going forward, we should expect access to information about various types of data used to train AI systems, including data used in the process of reinforcement learning or retrieval augmentation, where an AI model accesses a new pool of data before producing its output, as well as data used to evaluate a model’s performance.
Some parts of the AI community have made progress on training data transparency. Inspired by the work of AI safety pioneers Emily Bender, Batya Friedman, and Timnit Gebru, the Hugging Face platform promotes the use of model cards and dataset cards to its community of developers (Ozoani et al., 2023). Dataset cards can be used to document how a data set was created and what it contains, as well as potential legal or ethical issues to consider when working with it. Dataset Nutrition Labels take inspiration from nutritional labels for food, to “highlight the key ingredients in a dataset such as… demographic representation, as well as unique or anomalous features regarding distributions, [and] missing data” (Data Nutrition Project, 2023).
In July 2023, the White House announced seven large AI firms had committed to “develop robust technical measures to ensure that users know when content is AI-generated, such as watermarking” (Leibowicz, 2023). Given that foundation AI models have started to be trained on AI-generated data, these tools will have a role to play in documenting the provenance of training data as well as the integrity of downstream outputs from AI.
Decisions about what to document about training data might eventually be taken out of developers’ hands. In the United States, the Federal Trade Commission has recently ordered OpenAI to document all sources of data used to train its large language models (Zakrzewski, 2023). A group of large media organizations have published an open letter urging lawmakers around the world to introduce new regulations to require transparency of training data sets (Agence France-Presse et al., 2023). We should see demands for information about training data as but the latest wave in an ongoing push for corporate transparency. In the United Kingdom, laws around the mandatory registration and publication of information by companies go back to the 1800s, and over this time, regulators have developed standardized approaches to avoid each company choosing its own way to report on its finances and other activity. Perhaps we need the same for disclosures about the data on which foundation AI models have been trained (O’Reilly, 2023).
Whether companies step up or our governments intervene, we must ensure that the data used to train AI systems is not shrouded in secrecy. Public trust, our ability to mitigate their potential harms, and the effectiveness of our regulatory regime depend on it.
Jack Hardinges, Elena Simperl, and Nigel Shadbolt have no financial or non-financial disclosures to share for this article.
Agence France-Presse, European Pressphoto Agency, European Publishers’ Council, Gannett USA TODAY Network, Getty Images, National Press Photographers Association, National Writers Union, News Media Alliance, The Associated Press, & The Authors Guild. (2023, August 9). Preserving public trust in media through unified AI regulation and practices [Open letter]. https://drive.google.com/file/d/1jONWdRbwbS50hd1-x4fDvSyARJMCgRTY/view?pli=1
Data Nutrition Project. (2023). The Data Nutrition Project: Empowering data scientists and policymakers with practical tools to improve AI outcomes. Retrieved November 2023 from https://datanutrition.org/
Huang, S., & Siddarth, D. (2023, February 6). Generative AI and the Digital Commons. Working paper. Collective Intelligence Project. https://cip.org/research/generative-ai-digital-commons
Know Your Data. (2023). Know Your Data helps researchers, engineers, product teams, and decision makers understand datasets with the goal of improving data quality, and helping mitigate fairness and bias issues. Retrieved November 2023 from https://knowyourdata.withgoogle.com/
Leibowicz, C. (2023, August 9). Why watermarking AI-generated content won’t guarantee trust online. Technology Review. https://www.technologyreview.com/2023/08/09/1077516/watermarking-ai-trust-online/
Mitchell, M. (2023, April 12). Okay. Inspired by news &
@Stealcase , let me clarify something. When AI companies release “open training data” for a model [Image attached] [Post]. Mastodon. https://mastodon.social/@mmitchell_ai/110187818225660060
Naughton, J. (2023, August 19). The world has a big appetite for AI—but we really need to know the ingredients. The Guardian. https://www.theguardian.com/commentisfree/2023/aug/19/the-world-has-a-big-appetite-for-ai-but-we-really-need-to-know-the-ingredients
OpenAI. (2023). GPT-4. Retrieved November 2023 from https://openai.com/research/gpt-4
O’Reilly, T. (2023, April 14). You can’t regulate what you don’t understand. O’Reilly. https://www.oreilly.com/content/you-cant-regulate-what-you-dont-understand-2/
Ozoani, E., Gerchick, M., & Mitchell, M. (2023). Model cards. Hugging Face. https://huggingface.co/blog/model-cards
Schaul, K., Chen, S. Y., & Tiku, N. (2023, August 19). Inside the secret list of websites that make AI like ChatGPT sound smart. The Washington Post. https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
Xiang, C. (2023, March 16). OpenAI's GPT-4 Is closed source and shrouded in secrecy. Vice. https://www.vice.com/en/article/ak3w5a/openais-gpt-4-is-closed-source-and-shrouded-in-secrecy
Zakrzewski, C. (2023, July 13). FTC investigates OpenAI over data leak and ChatGPT’s inaccuracy. The Washington Post. https://www.washingtonpost.com/technology/2023/07/13/ftc-openai-chatgpt-sam-altman-lina-khan/
©2024 Jack Hardinges, Elena Simperl, and Nigel Shadbolt. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.