Data democratization enhances trust through transparency, inclusion through availability, and diversity and growth through productivity. Their democratization, however, must be systematic and cautious to avoid potentially inflicting harm.
Keywords: artificial intelligence (AI), data democratization, data science, deep learning, large language model (LLM), pretrained language model (PLM)
Many bicker about whether data are actual knowledge or simply a means to yield knowledge. Independent of the position taken, clearly those with data are those with knowledge, and knowledge is power. Years ago, an article in the Economist touted data as the world’s most valuable resource (“The World’s Most Valuable Resource,” 2017); now their value is even greater. Thus, if we wish to ‘level the playing field’ and enable many to participate in advancing society via science, democratizing data is a step toward getting there.
Neither the recognition of the need nor the efforts to provide data to the public are new. The research community has long pursued data availability in diverse areas. For instance, in 1992, the information retrieval community introduced a still ongoing forum, the Text REtrieval Conference (National Institute of Standards and Technology, 2023), to generate publicly available benchmark data for repeatable scientific experimentation. By the early 2000s, the information retrieval community was predominantly focused on the design, development, and improvements of web search engines. To foster research in that domain, in 2006, America Online (AOL) released a sanitized snapshot of its query log (Pass et al., 2006). Due to privacy concerns, this release quickly became controversial; yet the use of that query log persists. The availability of this and other such logs directly resulted in enhancements to search engine technology; that is, today web search engine efficiency, accuracy, and usability are all enhanced by research conducted using query logs, affecting the daily lives of much of the world’s population.
In a different domain, to drive innovations in clinical informatics and epidemiology, over the past two decades a family of de-identified patient record data sets were made publicly available. From a relatively small number of records to the tens of thousands, the Medical Information Mart for Intensive Care (MIMIC) collections (Johnson et al., 2016, 2023; Lee et al., 2011; Moody & Mark, 1996) provide a sampling of extracted electronic medical records. Some other publicly available medical data sets likewise exist. These varied collections enabled novel clinical protocol improvements, now deployed.
As illustrated, data availability has yielded technology improvements that have altered the daily lives of many if not all of us. These examples, however, are only retrospective; what are the current and future implications of data availability?
Today, one cannot hide from artificial intelligence (AI). From self-driving cars to robotic humanoids, from logistical analysts to storytelling, AI is everywhere and in everything. The typically unspoken question is: ‘Where do these AI-driven systems get the training data that they rely on?’ The answer is not always known.
When it comes to social interaction, the focus often is on conversations; namely, question answering, dialogue systems, and storytelling. The foundation behind most, if not all, such systems is deep learning. As deep learning nomenclature is unsettled, at times nuanced or simply misused, for simplicity, I adopt the designation foundational model (FM) (Bommasani et al., 2021), an interpretation that includes, without necessarily distinguishing, the common pretrained language models (PLMs) and recent large language models (LLMs). For our purposes, the architectural and training differences between PLMs and LLMs are immaterial. Suffice to say that training any FM requires vast training data, and generally, training is based on either masked language modeling or next token prediction. In masked language modeling, a random word within a word sequence is masked, namely hidden, and the model is trained to predict that word. Next token prediction, as its name implies, requires the model to predict the next token based on previously seen sequences of tokens.
Foundational models are essential tools in understanding and generating humanlike text. They perform many key natural language tasks such as summarization, translation, classification, storytelling, dialogue, and so on. However, these capabilities are only possible if the models are trained on vast volumes of high-quality, diverse training data. Such data ensure that the models can understand and respond to a wide array of linguistic inquiries, patterns, and nuances, resulting in more accurate and sophisticated, often natural language, interactions.
Many such FMs abound. For instance, Google’s Bidirectional Encoder Representation from Transformers (BERT) (Devlin et al., 2018) is trained by masked language modeling using the entire English Wikipedia corpus as well as a book corpus, which is roughly around 3.3 billion tokens. Meta’s recent large language model Meta AI (LLaMA) (Touvron, Lavril et al., 2023) is trained strictly on public data that is orders of magnitude larger (1.4 trillion tokens). Their subsequent versions (LLaMA 2; Touvron, Martin et al., 2023) expand and scale the training data size to two trillion tokens. Finally, OpenAI’s Generative Pre-trained Transformer 4 (GPT-4) (OpenAI, 2023) avoids public specification of its training data. Although training data source specificity is needed for reproducibility and some level of transparency, reluctance to specify the precise data sources or training methods used might reflect concerns about copyright infringement cases working their way through the courts (Samuelson, 2023).
Legal issues aside, these publicly available large language models are providing tools to previously resource-constrained, ‘locked out’ developers and researchers, enabling their participation and contribution. More so, with techniques such as model distillation, namely, transferring of knowledge from larger models to smaller ones, and parameter efficient training, even resource-constrained organizations have potential to meet resource demands. That said, then why is the availability of FMs further necessitating the need for democratizing data? After all, FMs are actually capturing the essence of the data without requiring users to access the data. Would it not thus be the case that the availability of FMs is eliminating the need for public data by capturing data content in a sharable encoded manner? The answer rests in the differing ways FMs are used.
Common FMs are general in nature. They address a wide scope of topics; colloquially, they are generalists. To support such, they are trained using data covering a wide diversity of topics. They perform a wide diversity of tasks; they answer a wide diversity of questions.
There are, however, specialists. Typically, these specialist FMs start with pretrained generalist models and are further trained using domain-specific data. For example, BERT can be viewed as a generalist FM, while astroBERT (Grezes et al., 2021), ClinicalBERT (Huang et al., 2019), BioBERT (Lee et al., 2020), FinBERT (Araci, 2019), and SciBERT (Beltagy et al., 2019), all BERT derivatives, are specialists, having been additionally trained on astronomy documents, medical notes, general and biomedical domain corpora, a financial corpus, and scientific texts, respectively.
There is no magic behind the desire for specialists; specialists are simply better equipped to handle concerns within their specialized field than are generalists. We, as humans, make such distinctions daily when we seek medical advice. One readily goes to a primary care physician (generalist) when one is suffering from the presumed common cold; however, one would seek advice from a cardiologist or pulmonologist (specialists) when one is suffering from heart or lung ailments. Foundational models are no different; studies have shown accuracy gains touted by specialists for question answering and predictions within their given specialty. However, to train specialists, domain data are needed, and obtaining such data motivates data democratization.
Efforts that democratize FM encodings exist. That is, rather than having the data available, the encodings that drive the language models are distributed. Specifically, companies are freely providing their language models to the general public, at times restricted to noncommercial use. While the availability of these encodings, hence models, do enable experimentation by others, as the underlying data actually driving the encodings are unavailable, it is not possible to validate the generated results. Science-based decision-making requires the possibility to validate, necessitating data availability.
Another motivator for the availability of data is hallucination. Simply put, hallucination in the FM context is fabrication, namely, the fictious creation of information. Fabrication is acceptable, and maybe even touted, in many contexts, for example, children storytelling and poetry. However, for domains such as scientific discovery and policy decision-making, areas where one seeks democratized data, reality (facts and figures) and transparency are key; fiction is not tolerated. Search engines and scientific predictive systems, for example, drug efficacy predictors, confront such issues regularly, and their typical solution is grounding, namely, source attribution. By identifying the information sources used to derive the answer or prediction, verification is possible. Without the possibility of verifying the inference, confidence in the response is and must be low. Grounding, however, requires the source data, yet another demand for data democratization.
A word of caution is now warranted. Not all data are good data; some data sources are flawed; some data inadvertently propagate bias. Thus, data democratization is valuable only if the data themselves are credible and verified. Unfortunately, the wide availability of FMs is impacting the validity of some of the available data used for evaluation. For example, human rater data are often used in evaluation, for both research and commercial purposes. However, at least in research practice, crowd sourcing is widely used for labeling and annotating data. As described in Veselovsky et al. (2023), however, a significant number of crowd-sourced laborers who are generating these ratings are relying on LLMs to label the data. This practice casts doubt on the quality of the derived, possibly hallucinated, data, necessitating the availability of authenticated and validated foundational data.
An ongoing example of societal and scientific gains as a direct consequence of democratizing data relates to the release of Taiwan’s National Health Insurance Research Database (NHIRD) (Hsieh et al., 2019). NHIRD is a research database containing claims data from Taiwan’s National Health Insurance system that covers more than 99% of Taiwan’s population. Nearly all outpatient and inpatient settings are included for now more than two decades. Given proper cause and authorization, researchers can obtain deidentified, but still containing some demographic information, samples of the NHIRD either by patient count (say roughly two million) or disease specific. Full population data sets are likewise possible.
Many studies used NHIRD data. Clinical advancements were derived. For example, a prescription efficacy predictor was invented (Frieder et al., 2022). Using a graph-based representation of patient records, a multigraph kernel approach was developed. Via deep learning, an inference engine relying on digital twins predicts the likely efficacy of a given medication for a given patient suffering from a particular ailment (Yao et al., 2019). The availability of NHIRD data enabled nonmedical practitioners to develop treatments for clinical use. Additional NHIRD studies focus on dementia (Fan et al., 2023), pancreatic cancer (Lee et al., 2022), and other ailments and treatments (Sung et al., 2020).
Finally, let me dispel a periodically touted misconception: ‘one needs to integrate data to earn greater benefit from them, and to integrate them implies their centralization.’ The first clause is correct; however, integrating multiple sources of data need not imply their centralization. While integration of multiple sources potentially yields greater inference capability, with available data, techniques such as federated (collaborative) learning can be deployed. Federated learning is a machine learning technique where each data source is used to train a model independently and the learned models are then merged to yield global inferencing capability. Thus, while retaining control, democratized data sites distributed across multiple sites with independent control can still enable global learning.
As artificial intelligence further permeates daily activities, society’s anxieties related to it increase. With this increasing angst, society will continue to demand greater understandability, fairness, and accountability in AI-driven processes. Policies that promote fair and appropriate use of data will be proposed, discussed, and eventually, enacted, and by necessity, these policies will need to address data curation, storage, access, usage, duration, and privacy concerns.
To improve decision-making, science, and generally society while ensuring privacy, reliable data should be democratized. Democratizing such data would enhance trust in decision-making as transparency typically exposes, and hence reduces, the possibility of bias. The availability of data increases the potential for open scientific processes by enabling initial discovery and then repeatability for verification; it likewise enables others, namely, those besides the current data owners, to participate and contribute.
Data availability enables more targeted use of AI via specialized FMs. As previously discussed, specialized FMs serve to better explore and understand specific domains, and in some cases, provide explainable, namely justified, answers.
Finally, with data availability comes innovation. Democratizing safe, accurate, privacy-protecting data potentially opens many possibilities that we have not yet begun to comprehend. Bluntly stated, democratizing data furthers research, science, and the innovation that builds on both.
Notwithstanding the vast benefits of data democratization, I must conclude with a word of caution. Data must be democratized systematically and with care; haphazard release can lead to harm. Data might contain explicit or implicit personal identifiable information (PII); privacy-preserving techniques such as differential privacy (Dwork, 2006) can be considered a means to reduce PII leakage. Data might likewise be toxic or propagate bias; semantic analysis and hate speech detection techniques (MacAvaney et al., 2019) can be deployed to combat such harms. Other possible harms, such as but not limited to copyright violations, are similarly possible. So, democratize data but do so carefully.
I am thankful for the insightful comments made by: Arman Cohan, Nazli Goharian, Sean MacAvaney, Eugene Yang, Hao-Ren Yao, Andrew Yates, and the anonymous reviewers. Their suggestions vastly improved this article.
Ophir Frieder has no financial or non-financial disclosures to share for this article.
Araci, D. (2019). FinBERT: Financial sentiment analysis with pre-trained language models. ArXiv. https://arxiv.org/abs/1908.10063
Beltagy, I., Lo, K., & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. ArXiv. https://arxiv.org/abs/1903.10676
Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Quincy Davis, J., Demszky, D., ... Liang, P. (2021). On the opportunities and risks of foundation models. ArXiv. https://arxiv.org/abs/2108.07258
Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv. https://arxiv.org/abs/1810.04805
Dwork, C. (2006). Differential privacy. In M. Bugliesi et al. (Eds.), International Colloquium on Automata, Languages, and Programming: ICALP 2006 (pp. 1-12). Springer Berlin Heidelberg.
Fan, Y. C., Lin, S. F., Chou, C. C., & Bai, C. H. (2023). Developmental trajectories and predictors of incident dementia among elderly Taiwanese people: A 14-year longitudinal study. International Journal of Environmental Research and Public Health, 20(4), Article 3065. https://doi.org/10.3390/ijerph20043065
Frieder, O., Yao, H.-R., & Chang, D.-C. (2022). Method and system for assessing drug efficacy using multiple graph kernel fusion (US Patent No. 11,238,966). U.S. Patent and Trademark Office. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/11238966
Grezes, F., Blanco-Cuaresma, S., Accomazzi, A., Kurtz, M. J., Shapurian, G., Henneken, E., Grant, C. S., Thompson, D. M., Chyla, R., McDonald, S., Hostetler, T. W., Lockhart, K. E., Martinovic, N., Chen, S., Tanner, C., & Protopapas, P. (2021). Building astroBERT, a language model for astronomy & astrophysics. ArXiv. https://arxiv.org/abs/2112.00590
Hsieh, C. Y., Su, C. C., Shao, S. C., Sung, S. F., Lin, S. J., Kao Yang, Y. H., & Lai, E. C. (2019). Taiwan's National Health Insurance Research Database: Past and future. Clinical Epidemiology, 11, 349–358. https://doi.org/10.2147/CLEP.S196293
Huang, K., Altosaar, J., & Ranganath, R. (2019). ClinicalBERT: Modeling clinical notes and predicting hospital readmission. ArXiv. https://arxiv.org/abs/1904.05342
Johnson, A. E. W., Bulgarelli, L., Shen, L., Gayles, A., Shammout, A., Horng, S., Pollard, T. J., Hao, S., Moody, B., Gow, B., Lehman, L. H., Celi, L. A., & Mark, R. G. (2023). MIMIC-IV, a freely accessible electronic health record dataset. Scientific Data, 10, Article 1. https://doi.org/10.1038/s41597-022-01899-x
Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L. W., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, Article 160035. https://doi.org/10.1038/sdata.2016.35
Lee, H. A., Chen, K. W., & Hsu, C. Y. (2022). Prediction model for pancreatic cancer: A population-based study from NHIRD. Cancers, 14(4), Article 882. https://doi.org/10.3390/cancers14040882
Lee, J., Scott, D. J., Villarroel, M., Clifford, G. D., Saeed, M., & Mark, R. G. (2011). Open-access MIMIC-II database for intensive care research. In 2011 Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 8315–8318). IEEE. https://doi.org/10.1109/IEMBS.2011.6092050
Lee, J., Yoon, W., Kim, S., Kim, D., Kim, S., So, C. H., & Kang, J. (2020). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240. https://doi.org/10.1093/bioinformatics/btz682
MacAvaney, S., Yao, H. R., Yang, E., Russell, K., Goharian, N., & Frieder, O. (2019). Hate speech detection: Challenges and solutions. PLoS ONE, 14(8), Article e0221152. https://doi.org/10.1371/journal.pone.0221152
Moody, G. B., & Mark, R. G. (1996). A database to support development and evaluation of intelligent intensive care monitoring. Computers in Cardiology, 23, 657–660. https://doi.org/10.1109/CIC.1996.542622
National Institute of Standards and Technology. (2023). Text REtrieval Conference. U. S. Department of Commerce. https://trec.nist.gov/
OpenAI. (2023). GPT-4 technical report. ArXiv. https://arxiv.org/abs/2303.08774
Pass, G., Chowdhury, A., & Torgeson, C. (2006). A picture of search. In X. Jia (Ed.), InfoScale '06: Proceedings of the 1st International Conference on Scalable Information Systems (pp. 1–es.). ACM. https://doi.org/10.1145/1146847.1146848
Samuelson, P. (2023). Legal challenges to generative AI, Part I. Communications of the ACM, 66(7), 20–23. https://doi.org/10.1145/3597151
Sung, S. F., Hsieh, C. Y., & Hu, Y. H. (2020). Two decades of research using Taiwan's National Health Insurance Claims Data: Bibliometric and text mining analysis on PubMed. Journal of Medical Internet Research, 22(6), Article e18457. https://doi.org/10.2196/18457
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriquez, A., Joulin A., Grave E., & Lample, G. (2023). LLaMA: Open and efficient foundation language models. ArXiv. https://arxiv.org/abs/2302.13971
Touvron, H., Martin, L., Stone, K. R., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D. M., Blecher, L., Ferrer, C. C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., ... Scialom, T. (2023). Llama 2: Open foundation and fine-tuned chat models. ArXiv. https://arxiv.org/abs/2307.09288
Veselovsky, V., Ribeiro, M. H., & West, R. (2023). Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. ArXiv. https://arxiv.org/abs/2306.07899
The world’s most valuable resource is no longer oil, but data. (2017, May 6). The Economist. https://www.economist.com/leaders/2017/05/06/the-worlds-most-valuable-resource-is-no-longer-oil-but-data
Yao, H.-R., Chang, D.-C., Frieder, O., Huang, W., & Lee, T. (2019). Multiple graph kernel fusion prediction of drug prescription. In X. Shi et al. (Eds.), Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics (pp. 103–112). ACM. https://doi.org/10.1145/3307339.3342134
©2024 Ophir Frieder. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.