There is no doubt that Generative AI (GAI) will affect the unconnected more than what is stressed in current AI debates. The fact that they are left behind in terms of internet access prevents them not only from accessing online information, but also from using GAI. GAI generates new material from online data, material that will be used again in the development of new GAI systems. Thus, the philosophy and reality conveyed by online data that do not always hold, especially among unconnected communities. The expanding use of GAI will thus perpetuate and create further harm to the unconnected. This pain will continue to grow as long as they are left behind in terms of access. There is a pressing need to connect the unconnected, if we do not want them to suffer the pain of being left out in the development of AI systems due to the lack of data.
Keywords: generative AI, unconnected, Internet access, discrimination, AI ethics, data
The term ‘unconnected’ typically refers to people who do not have access to the internet. The most recent International Telecommunication Union (ITU) figures show that 2.7 billion people are still offline (ITU, 2022). If these figures represent one-third of the world's population, it should be noted that these statistics are exactly reversed in the least developed countries (LDCs) and landlocked developing countries (LLDCs), with only 36% of the population currently online and nearly two-thirds unconnected. Moreover, a sizable portion of the people in LDCs and LLDCs who are deemed to be connected only have Level 1 access to the internet, which means limited access to the internet (poor connectivity), resource constraints, low literacy levels, and lack of relevant content.
To be left behind in terms of internet access prevents people not only from accessing online information, but also from using GAI systems, which are mainly accessed online. In addition, if some countries have banned tools such as ChatGPT, there are countries that have not (yet?) officially been granted access. According to OpenAI, the reasons for restricting access to ChatGPT are diverse and include technical challenges and resource limitations. Therefore, OpenAI decided to ‘prioritize’ some countries. In the list of supported countries (OpenAI, 2023), 16 African countries are not included, with the majority having the lowest internet penetration rates on the continent, such as Burundi with only 14.60% in 2022. This goes against the nondiscrimination principle promoted in the United Nations Educational, Scientific and Cultural Organization (UNESCO) recommendations on the Ethics of AI (UNESCO, 2021). But the pain is just beginning.
Successful AI models depend on the availability of training data. As datafication dramatically expands due mainly to online activities, the process will be exponentially accelerated with the introduction of GAI. In fact, GAI models are trained using huge data sets. For example, GPT-3 has been trained using 570GB (300 billion words) of data obtained from the internet and includes books, web texts, Wikipedia pages, and other pieces of writing (Gonsalves, 2023). After their training, GAI models are used to generate new content, comprising text, images, audio, or video, which resembles or draws inspiration from existing data, thereby contributing to the amplification of the datafication process.
Who uses GAI systems to generate new content? Surely not the unconnected since they do not have access to the internet. Where are those new contents stored? Surely, again, on the internet! So, they will be added to existing online content. Since GAI models become ever larger and data hungry, data dependency will increase, fostering the reuse of previous GAI-generated contents. Such a process will be facilitated with the release of GPT Bot that has been designed to collect text data from the internet to improve the language models, including the upcoming GPT-5. This will perpetuate the limited picture of reality conveyed by online data that do not always hold for everyone, especially for members of unconnected communities, whose lived experience and cultural artefacts and values have not been registered digitally and have thus been excluded from the synthetic world images generated by GAI systems.
“Does a goat eat paper?” This question arises in a marking room during a national exam in Cameroon. A teacher bursts out laughing after reading a passage from a student's paper that stated: “The goat eats paper.” Several other teachers join in, but one becomes indignant and says: “In my village, that is true!” Total silence fell over the marking room, and the student’s paper was reconsidered. These teachers lived in the same country, but this truth was only known to a minority. What would happen if a GAI system designed for marking papers was developed without this local knowledge? The resulting application would be harmful and painful for those who are not represented in terms of digital data.
In early 2023, we asked ChatGPT: “Do goats eat paper?” The answer was “No, goats do not eat paper. Goats are herbivores that mainly consume grass, hay, and other vegetation.” But later, with ChatGPT-3.5, the answer was reasonable, arguing that “the goat can eat paper, even if it is not good for his health.” We asked the same question to Google Bard using the Nova mobile application and the answer was: “Goats are herbivores and eat a variety of plants, including leaves, grass, bark, and fruit. They do not eat paper, as it is not a natural part of their diet.” The system goes even further and states that “if a goat eats paper, it is important to take it to the vet immediately, as they may need to be treated for an intestinal blockage.” But in some villages in sub-Saharan Africa, goats are used to eating paper without being harmed. Such a truth is only known by a minority and may not be online since those who know this truth are mainly offline, or even online, but with a low digital literacy level that prevents them from producing content that can later be used in the development of AI systems.
So, no matter how large large language models are, they can be meaningless or even harmful to communities that are not represented in data sets. As Andrew Ng states, we need better data, not bigger data (Baeza-Yates, 2023). While current debates around data justice and AI ethics are ongoing, we should bear in mind that we first need data. The representativity of minorities in data sets, as recognized by data justice thinkers, can only be achieved if there are sufficient data from minorities. So, the way to not “take even what they have” is to give them the opportunity to produce data, and better data. They should no longer be ‘left behind’ in terms of access if we do not want them to be ‘left out’ in the development of AI systems.
Jean Louis Kedieng Ebongue Fendji has no financial or non-financial disclosures to share for this article.
Baeza-Yates, R. (2023, August 2). The size doesn’t matter: Data just has to be right.
Forbes Technology Council. https://www.forbes.com/sites/forbestechcouncil/2023/01/19/the-size-doesnt-matter-data-just-has-to-be-right/?sh=2e1bbb797031
Gonsalves, C. (2023). On ChatGPT: What promise remains for multiple choice assessment? Journal of Learning Development in Higher Education, (27). https://doi.org/10.47408/jldhe.vi27.1009
International Telecommunication Union. (2022, August 7). Measuring digital development: Facts and Figures 2022. ITU. https://www.itu.int/hub/publication/d-ind-ict_mdd-2022/
OpenAI (2023, August 10). Supported countries and territories. https://platform.openai.com/docs/supported-countries.
United Nations Educational, Scientific and Cultural Organization. (2021, August 2). Recommendation on the ethics of artificial intelligence. UNESCO. https://unesdoc.unesco.org/ark:/48223/pf0000381137
©2024 Jean Louis Kedieng Ebongue Fendji. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.