Like many, the COVID pandemic has affected me personally and professionally in profound ways. Concerned that social isolation might lead to physical or mental atrophy, I paired my solitary walks with episodes of “History of Philosophy Without Any Gaps” hosted by philosopher Peter Adamson. During an episode about Plato’s Republic, I had to stop and replay when I heard that democracy was considered the second worst form of governing, after timocracy (ruled by the honored/valued) and oligarchy (ruled by a few who love money-making), and worse, “the natural next step from democracy is tyranny” (Adamson, 2014, p.150). For those of us who value and enjoy democracy, such thinking and assertions are shocking, or at least disturbing. Dismissing these considerations and assertions as relics of the past would give many of us peace of mind and ensure my walk continuing with equanimity.
However, Adamson’s (2014) delineation of Plato’s reasonings remains on my mind to this day, even if I could conveniently ignore the increased “doomsday” feeling across the ideological spectrum that the 2024 U.S. election outcomes could jeopardize democracy itself, as discussed in my last editorial (Meng, 2024). For any ecosystem, whether human or digital, when many groups and individuals consider themselves as stakeholders of one kind or another, there must be strong balancing forces that operate at individual or ground levels to sustain an equilibrium. For the data science community, the phenomenon of regression toward the mean may serve as a particularly assimilable reminder of the existence of such balancing forces, whether we understand or even can name them. If no proper force exists to pull individuals or groups in extreme tails ‘toward the mean,’ then by default their extreme tendencies would propel progressively until the existing equilibrium breaks.
Breaking the present equilibrium may well be evolutionarily necessary for an ecosystem to survive, especially when disruptive forces emerge. As a matter of fact, democratizing data becomes essential because of the boom of data science and most recently generative AI. As Julia Lane and Nancy Potok (2024), two of the three co-editors of this special issue on “Democratizing Data: Discovering Data Use and Value for Research and Policy,” reminded us in their preface, “Democratizing Data: Our Vision,” much of the data today are collected, controlled, and converted to valuable assets by companies in private sectors “for their own purposes.” Most of these purposes include key elements for benefiting human society in order to sustain the business; keeping consumers happy is in the self-interest of any for-profit (legitimate) business. Nevertheless, as an ecosystemic balancing force, we also need to build public counterparts for doing the same but for our own purposes, which, again in the words of Lane and Potok (2024), is to help to transform data produced by researchers and governments into “valuable assets for the public good.”
The essentialness of having balancing force also underlies Christine L. Borgman and Amy Brand’s (2024) article, “The Future of Data in Research Publishing: From Nice to Have to Need to Have?” where they stress that the need to have open data is “at least partly in response to the opacity of artificial intelligence algorithms.” Similarly, Manish Parashar’s (2024) article, “Enabling Responsible Artificial Intelligence Research and Development Through the Democratization of Advanced Cyberinfrastructure,” reminds us that today “advances in AI R&D are very often tied to access to large amounts of computational power and data,” which currently are possible only for a handful of large commercial entities. Concerns such as privacy, civil rights, and civil liberties then naturally arise. Democratizing advanced cyberinfrastructure together with democratizing data can serve as a strong balancing force to mitigate these negative impacts and to “enable responsible AI R&D that benefits all” (Parashar, 2024).
Nancy Potok’s (2024) overview article, “Data Usage Information and Connecting With Data Users: U.S. Mandates and Guidance for Government Agency Evidence Building,” describes another emerging—and much welcomed—force for change. As Potok reminded us, the journey toward evidence-based policymaking can be traced back to nearly a century ago, with the introduction of randomized trials and other principled methods for establishing scientific evidence that have directly informed public health and other policies. The Foundations for Evidence-Based Policymaking Act of 2018 (2019; often referred to as the Evidence Act) discussed in detail in Potok’s (2024) article is a significant milestone in this journey, especially with its mandate for government agencies to directly engage with the users of their data assets and data sets.
The interview on “The View From Four Statistical Agencies” (on agriculture, economics, education, and science and engineering) also conducted by Nancy Potok (Potok et al., 2024) makes it clear that the Evidence Act already has direct impact, and the tools for engaging users and capturing interactions with them are generating great excitements.
To the many potential benefits of the agency-user engagement that Potok and the agencies’ leaders articulated (Potok et al., 2024), I’m eager to add one that has been on my wish list for three decades. To investigate problems with multiple imputation inference (Rubin, 1987) as reported by researchers in several government agencies (e.g., Fay, 1992; Kott, 1992), I introduced the notion of uncongeniality in Meng (1994) to investigate invalid statistical results caused by the incompatibility between the statistical model an agency (e.g., U.S. Census Bureau) adopts for imputing missing data before releasing and the users’ procedures for analyzing the released data. This uncongeniality is inevitable in general because the agency must adopt some model to proceed, but the agency has no control over how the data will be used and what kind of approaches will be applied. However, the negative impact of uncongeniality can be reduced or even controlled if the agency has information on the models and procedures that are likely to be used on the released data, and on which parts of the data are more likely to be analyzed (since not all users will analyze all the variables in the data).
But such kind of usage information was and still is unavailable, and without which correcting for uncongeniality is essentially impossible. The only ‘safe’ method available is a rather conservative “variance doubling” approach (Xie & Meng, 2017). As such, for decades I have been telling people that there is a whole missing area of study about collecting and analyzing metadata on how users analyze data. With the user engagement mandate from the Evidence Act, the pioneering Democratizing Data project, and all the data usage search and discovery methods and platforms being explored in this special issue, I now see the possibility of such a research area being developed before I take my permanent sabbatical!
Achieving all these goals, however, will take immensely more than what is needed within a private company, precisely because the governance of a company is typically oligarchic or even timocratic. We of course can contemplate timocracy or oligarchy for achieving our goal. We can imagine giving the task to a central government or a few centralized commercial entities. But as Ian Mulvany (2024) reminds us in his insightful and inspiring article, “Scale, Publishing, and Open Data,” the former suffers from (at least) the problem of scalability (he reminded us of the modern pattern system). The latter can sustain with resources generated from their profit, but then they would be “only able to allocate resources where profits can be made” (Mulvany, 2024).
A distributed system, involving many communities of experts, leaders, and builders, therefore becomes viable. Whereas Mulvany (2024) was speaking of the publishing world, and hence their shared medium is a journal, the pros and cons of such a system he articulated are rather general. This special issue, with 17 enlivening articles in addition to Lane and Potok’s preface, showcases the spectrum of the expertise, perspectives, and workforces essential for democratizing data (and supporting enterprises such as cyberinfrastructure) via many distributed systems, and for executing it well, or at least for avoiding Plato’s ‘natural next step,’ metaphorically speaking.
Such a distributed enterprise forms a sub-ecosystem within the data science ecosystem, with its constituting communities co-evolving as we enter deeper into the digital age, as discussed in my inaugural HDSR editorial, “Data Science: An Artificial Ecosystem” (Meng, 2019a). The articles in the second issue of HDSR (issue 1.2, Fall 2019) inspired me to explore at least five of what I called “Immersive 3D Surroundings” for this artificial ecosystem in the sequel editorial (Meng, 2019b), namely,
Sectors: academia, government, and industry;
Specialties: humanities, social sciences, and STEM;
Scales: spatial, temporal, and structural;
Sophistications: novice, apprentice, and expert;
Stances: philosophical, analytical, and practical.
Data democratization, being an ecosystem within the data science ecosystem, naturally inheres these environments, as revealed by various articles in this special issue. However, collectively the special issue reminds me of a missing and yet arguably the most immersive surrounding for data democratization and more broadly for data science:
Stakeholders: funders, suppliers, and users.
This environment accentuates individuals, communities, and institutions, with its fundamentality lying in the fact that data demoralization inherits the same creed as a democratic government; that is, it is of people, by people, and for people. The remainder of this editorial will primarily be structured around the roles and value systems of the stakeholders involved. This approach aims to underscore the challenges, advancements, and opportunities unveiled in this special issue on “Democratizing Data: Discovering Data Use and Value for Research and Policy” as we delve into the complexity, realizability, and sustainability of data democratization.
This section title may sound platitudinous, but as the editor-in-chief of HDSR, a diamond open-access platform, it also rings a daily bell—or rather a nightly alarm, if you ask me what makes me lose sleep. The mission of HDSR is to feature “everything data science and data science for everyone,” an essential endeavor for democratizing data science. Not creating any financial differentiation for authors or readers therefore is mandated by the mission itself. But there is no free lunch—someone must pay for it. Publishing with quality, in both content and delivery, is an expensive business. I am deeply grateful to hundreds of board members and thousands of anonymous reviewers for volunteering their time. However, production and marketing incur cost, and editorial staff need to be compensated, and compensated well for their hard work, especially when everyone is overworked because the growth of budget lags behind that of the platform.
Zooming out from my daily reminder to the monumental undertaking of democratizing data at a societal level, I am particularly delighted to see in this special issue the interview of officers from four philanthropic foundations by co-editor Julia Lane that stresses “The Importance of Philanthropic Foundations in Democratizing Data” (Lane, Feldman, et al., 2024). The central questions Lane posed to each foundation reflect a democratizing fundraising strategy, which is the only sustainable approach for an ecosystemic project of this scale, and with a forever-expanding horizon. The pairing of the questions “What are the foundation goals?” and “What attracted you to the Democratizing Data project?” reminds us of an obvious but often easily overlooked fact: any funder is always a critical stakeholder. No matter how exciting or noble our project might be, how much a funder is willing to support is governed by the funder’s goals (the first question) and their valuations of the project (the second question), not ours.
Being mindful that any funder is a stakeholder is not only critical in successful fundraising, but also essential to ensure that our project will not be unduly influenced by any funder, whether intentionally or habitually. Lane’s third question, “What is your vision for the future?” offers them an opportunity to be transparent about their desiderata moving forward (Lane, Feldman, et al., 2024). With an ecosystemic project like democratizing data, there should be some subprojects to match funders’ goals and hence sustain the funding, while keeping the integrity of the project and most importantly the independence of the data democratization process. In an era where disinformation and misinformation become powerful weapons, keeping data independent is an ongoing struggle, as seen from the effort of independently monitoring the U.S. statistical system (Auerbach, 2023). It is therefore prudent for us to build data quality and independence monitoring as an integrate part of data democratization from the get-go.
Democratizing fundraising also means seeking all possible funding sources. Funding by governments will of course always be a large part of it. But as a stakeholder, government agencies, just like philanthropic foundations, have their mandates and value systems. While it is possible, at least in principle, to estimate the cost for creating a public-use data file by government agencies—say the Landsat satellite image data by the National Aeronautics and Space Administration (NASA) and the United States Geological Survey (USGS)—it is an entirely different matter to assess the economic values and societal benefits of such data.
Indeed, in “A Mapping Lens for Estimating Data Value,” Abhishek Nagaraj (2024) argues that data have no inherent value because the value of a data set depends on what it is used for. This is a particularly salient feature for public data sets, which may be used by arbitrarily many users for arbitrarily many different purposes, some of which may not be imaginable at the time of data creation (consider many historically digitized texts now used for large language model training). Whereas I surmise few would disagree with Nagaraj’s assertion, not everyone has contemplated the value of a data set counterfactually. But that is exactly the insight that enables Nagaraj and other researchers to assess the value of a public-use data set for a particular purpose via a natural experiment.
For example, due to technical errors and cloud obscurations, the Landsat data coverages are not nearly as global as intended. Such data defects, however, effectively create a quasi-experiment to study the impact of Landsat data on gold exploration by mining companies. This is because we can compare discovery rates of gold deposits at locations with no Landsat coverage (the control group) and with Landsat coverages (the treatment group). The validity of such an approach of course depends on whether the mechanisms that are responsible for the technical errors or cloud formulations themselves are indicative of the gold deposition. In addition to this example from Nagaraj (2021), Nagaraj (2024) provides three more captivating examples. This example is also a gold illustration of the sector surrounding the ecosystem, since it is about the value of a government data set to a private industry as assessed independently by an academic researcher. This ‘trinity’ is not coincidental, and indeed we shall see how it reveals itself naturally throughout the rest of this editorial, for example, as we leverage insights from business enterprise to build incentive systems for data suppliers for public consummations, for which both government agencies and academic institutions play multiple critical roles.
The article, “A Practical Use Case: Lesson Learned From Social Science Research Data Centers” by Stefan Bender, Jannick Blaschke, and Christian Hirsch (2024) explores an even harder problem, namely, assessing the value and cost of both providing and using data from research data centers. Research data centers (RDCs) are “often at the premises of the data owner, that provide accredited researchers with safe access to sensitive granular data” (Bender et al., 2024), and hence they provide a viable way to protect privacy without sacrificing data utility, which is the case for many other ways of protecting data privacy, such as by injecting noises (see the HDSR special issue on differential privacy for the 2020 U.S. Census).
Whereas the benefits of RDCs are not difficult to list, the cost-benefit analysis for RDCs is rather challenging. Bender et al. (2024) remind us that in the context of RDCs, the counterfactual approach is not an established one because the uniqueness of the confidentiality of the data often implies that no reasonable substitutes exist—substitutes that could be used as a control in a natural (quasi) experiment. Bender et al. therefore call for further research on assessing the value of data from RDCs as an indispensable part of the cost-benefit analysis of RDCs, which is undoubtedly a viable data suppler to deliver confidential data for public consumptions.
I choose ‘data suppliers’ as the second key stakeholder because of the encompassing nature of the phrase ‘supplier.’ That is, data suppliers cover any entities and individuals that are involved in the actual process for making data available to users, from data creators to data curators and to data discoverers, trackers, and disseminators. This holistic categorization is meant to encourage systems thinking and planning to ensure a healthy evolution of the data democratization ecosystem, learning from the painful lessons of the breakdown of the supply chains during the COVID pandemic.
I was therefore engrossed by Julia Lane, Alfred Spector, and Michael Stebbins’s (2024) “An Invisible Hand for Creating Public Value From Data,” for it embodies an ecosystemic planning to create incentives for public data suppliers, a vital pillar for the data democratization infrastructural. For readers who may wonder that if data are fuel of the digital age, why extract incentives would be still required, Lane, Spector, and Stebbins’s article provides compelling reasons and concrete examples, for both the government and academic sectors.
Government agencies need to justify their contributions and priorities by assessing the values of the data products they produce, which is a daunting task, as we have already discussed. However, before assessing the value of a data set, we first need to know who has used it, for what purposes, how it was used, and so on. As multiple articles in this special issue make abundantly clear, currently government agencies (at least in United States) have a poor understanding of the usage of their data products. Addressing such challenges is essential for building an incentive for the government sector but, as I shall summarize shortly, it also provides an incentive for the academic research community because these challenges stimulate methodological and theoretical advances, which are the ultimate golden apples for academic researchers.
I am also mindful, however, of the issue that the current reward systems in academic focus significantly more on theory, methods, and final results than on being a data supplier or more generally as an intermediary, regardless how critical such unsung roles might be. This leads to the syndrome of “Everyone wants to do the model work, not the data work,” as discussed in Lane, Spector, and Stebbins (2024). Our own practice and observations, especially from those of us who have substantial publishing and editorial experiences, are that we have a long way to go before most researchers produce data sets when they produce results. The less than 1% estimated rate of this co-production among ORCID records, as cited by Iratxe Puebla and Daniella Lowenberg’s (2024) article, “Building Trust: Data Metrics as a Focal Point for Responsible Data Stewardship,” therefore is not surprising.
However, as multiple articles in this special issue—including Puebla and Lowenberg (2024)—emphasize, open accessibility to data is critical but only the first step for data democratization. To ensure that data can be used, and correctly, takes significantly more than just making data available. Data supply is a process that involves many human judgments, from what to collect (every data collection has its scope), to how to pre-process (e.g., how to impute missing values), and to what to disseminate (e.g., due to computational constraints). Human judgments are fallible and are contextual dependent. With a record of the judgments, who made them, and in what context, the subsequent users at least will have a chance to be informed and be aware of their negative consequences.
It will take generations before we can reach this level of metadata, because to create such a culture will require the (data) scientific community to engage in “Self-Correction by Design,” as Marcia McNutt, the President of National Academy of Sciences and formal Editor-in-Chief of Science wrote for the special theme in HDSR on scientific reproducibility and replicability (McNutt, 2020). By “by design,” McNutt (2020) envisioned a cultural change in our education to adopt an approach that includes “educating students to perform and document reproducible research, sharing information openly, validating work in new ways that matter, creating tools to make self-correction easy and natural, and fundamentally shifting the culture of science to honor rigor.”
Taking McNutt’s call as an aspiration also for data democratization, I am very encouraged by a host of calls and strategies detailed in this special issue to motivate researchers to supply data settings in addition to data sets, to borrow this fitting contrast from the great book by Yanni Loukissas (2019), emphasizing that “all data are local.” In particular, Lane, Spector, and Stebbins’s (2024) “invisible hand” incentivizes academic researchers by creating a more efficient “market” for collection, curation, and dissemination of public data, as demonstrated through the Democratizing Data project. Lane, Spector, and Stebbins’s hope is to create “data usage and value” as an additional academic currency to publications and grants, the current two dominate incentives for academic researchers.
The articles in the “Search and Discovery Methods” section and in the “Practical Implementation” section of this special issue provide a fascinating introduction to the challenges and opportunities for methodological and theoretical researchers. Just reading the titles alone inspired me, as an educator and researcher, to contemplate a seminar series or even a course so I can learn and make myself useful to the data democratization enterprise, instead of just preaching about it.
Ryan Hausen and Hosein Azarbonyad’s (2024) article, “Discovering Data Sets Through Machine Learning: An Ensemble Approach to Uncovering the Prevalence of Government-Funded Data Sets” is an excellent starting point for understanding the scope of the problems in tracking the usages of government-funded data sets in scientific publications, the usage of natural language processing to make headways, and the remaining challenges to be addressed. In “Turning Visions Into Reality: Lessons Learned From Building a Search and Discovery Platform,” Attila Emecz, Arik Mitschang, Christina Zdawczyk, Maytal Dahan, Jeroen Baas, and Gerard Lemson (2024) describe the process and a host of challenges for implementing a machine-learning-fueled platform for such tracking, from defining the search corpus for data assets and data sets to validation, and from producing API (Application Programming Interfaces) to relational database managements.
The article by Nick Pallotta, John M Locklear, Xiangyu Ren, Victor Robila, and Adel Alaeddini (2024), “Discovering Data Sets in Unstructured Corpora: Discovering Use and Identifying New Opportunities,” addresses an even harder problem: tracking or even just discovering the use of government-funded data sets beyond scientific publication. The article provides a captivating case study about many data assets and data sets (about 450 annually) provided by the U.S. Department of Agriculture’s National Agricultural Statistics Service (USDA’s NASS), such as Hog Reports, or Crop Production Reports. These data assets directly impact U.S. agriculture and are used by a broad constituency, from staff to farmers, few of whom read, let alone publish in, what we academics consider as scientific publications.
Yet, that is still not the most challenging problem. The article by Katrina Sostek, Daniel M. Russell, Nitesh Goyal, Tarfah Alrashed, Stella Dugall, and Natasha Noy (2024) brings us to the jungle of WWW—the wild west of web: “Discovering Datasets on the Web Scale: Challenges and Recommendations for Google Dataset Search.” This editorial is already 4,000 words, and I still have three more articles to cover. I will therefore not give any hint to curtail readers’ wild imagination in contemplating what it entails in navigating such a WWW.
I will, however, invite readers to a more philosophical contemplation: What constitutes a data set, and what do we mean by its usage? Prior to text mining, most of us would not have considered a collection of novels or textbooks as a data set, but now they are easily being labeled as data sets for machine learning. If I look at a data set, without performing any formal analysis, it confirms what I believed (or wanted to believe). Does that count as a usage (or abusage)? If not, how is it different from a policymaker looking at a data set and forming an opinion about the policy to be made? If it is, how do we discover and classify such usages?
In case some readers dismiss such contemplations as nugatory academic exercises, the article “Searching for How Data Have Been Used: Intuitive Labels for Data Search and Discovery” by Christina Zdawczyk, Julia Lane, Emilda Rivers, May Aydin (2024) may change their minds. It is also a great article to highlight both the stance and specialty surroundings of the data democratization enterprise. Labeling and classification for discovery and tracking are surprisingly complex tasks because of the nuanced nature of forming and processing human knowledge, projected through rich yet ambiguous languages and represented by the necessarily coexisting revelatory and obfuscatory variations in data for respectively capturing contextual similarities and idiosyncrasies. Philosophical and conceptual contemplation, contextual and analytical distillation, and practical trade-off optimization and implementation are all indispensable processes of carrying out the tasks successfully, especially at large scale and with lasting stability.
A healthy evolution of the data democratization ecosystem requires ongoing coordination among the data suppliers and, equally important, an effective and sustainable collaboration between data suppliers and the users. The mandate of user engagement by the Evidence Act is therefore particularly welcomed. Lauren Chenarides’s (2024) “Creating Engagements: Bringing the User into Data Democratization” provides a general framework via the Theory of Change, as well as specific activities and examples for effective user engagements. It is a truly a thought-provoking and action-promoting article, with many of its recommendations having been investigated and tested in a variety of fields where user engagement is essential, from marketing to healthcare and to information systems.
The phrase “trust” appeared many times in Chenarides’s (2024) article. Explaining the necessity of building trust for user engagement would be an insult to readers’ intelligence, but it is worthwhile to emphasize that for a healthy evolution for the data democratization ecosystem, “trust must go in both directions,” to quote the aforementioned article on RDCs by Bender et al. (2024). For RDCs, the risks for the data suppliers are generally well understood and studied, as evident from the “Five Safes” framework (safe project, safe people, safe data, safe setting, and safe outputs) reviewed in Bender et al. (2024). But as Bender et al. point out, there are also risks for the users such as data suppliers’ censorship of undesirable topics, insufficient documentation on data quality, incorrect output checking, and misuse of researchers’ potential analysis ideas. Any such problems can deter users or at least reduce the value of RDCs as an integrated part of data democratization. Establishing the users’ trust in RDCs as a data supplier then becomes as important as trusting the users by the data suppliers.
The issue of trust is also at the heart of the article by Iratxe Puebla and Daniella Lowenberg (2024), “Building Trust: Data Metrics as a Focal Point for Responsible Data Stewardship.” The article highlights the Make Data Count Summit held in September 2023, which is a great example of how data suppliers, users, as well as funders convene to address critical issues for democratizing data. In this case, the focus is on building trustworthy metrics because “We need standardized evaluations of data usage to assess whether and how open data advances policy, science, and society, and this can only happen through evidence-based data metrics” (Puebla & Lowenberg, 2024). Establishing evidence-based data metrics for data democratization ensures that we practice what we preach as a data science community. But most importantly, only through such proper evaluation can we assess reliably the health conditions of the data democratization ecosystem and take actions as necessary and in time.
Whereas few of us would be worried that democratizing data can evolve into tyrannizing data, whatever that means, Plato’s thought that democracy can deteriorate to something worse is not something that we should completely dismiss. Indeed, the article by Ophir Frieder (2024) “On Democratizing Data: Diminishing Disparity and Increasing Scientific Productivity” provides repeated warning that data democratization “must be systematic and cautious to avoid potentially inflicting harm.”
A key concern discussed in Frieder (2024) is data quality—having more bad data can easily do more harm than having no data. As Frieder reminds us “Not all data are good data; some data sources are flawed; some data inadvertently propagate bias.” Frieder cites a rather alarming recent study. Human rater data are essential for evaluating and validating various machine learning procedures. Crowd sourcing is widely used for such human labors, which effectively is a form of democratizing a data gathering process. However, Veselovsky et al (2023) reported that a sizable crowd-sourced human raters used large language models (LLMs) to carry out the labeling tasks. This practice not only defeats the purposes of employing human raters, it also subjects data to problems such as hallucination.
In general, data science relies on the quality of data to deliver its promise, especially with those black box learning algorithms. To truly empower individuals and communities through data access and utilization, it is imperative to prioritize the integrity and reliability of the data being disseminated. This requires a concerted effort across various sectors and stakeholders to recognize the importance of data quality professionals and to establish mechanisms that ensure transparency and accountability in data supply. For example, scholarly journals can play a pivotal role in incentivizing researchers to prioritize data quality. For instance, by mandating a section on data quality in research publications, along with a “data confession” (Meng , 2021) component where researchers are required to disclose known limitations and defects in data sets, we can gradually build a culture where data quality receives comparable attention as methodological development or theoretical advances do.
Another way to systematically improve data quality is to provide professional recognition and support for those working in this field. Just as there are professional societies for various specialties, there should be societies dedicated to data quality. This would provide a platform for professionals to network, share best practices, and advocate for the importance of their work.
Additionally, akin to neighborhood crime watches or consumer reports, there should be mechanisms in place for vigilant oversight of data quality, such as ‘data watchers’ who monitor and report on discrepancies or errors in public data sets. Similarly, drawing lessons from the business sector, where consumer feedback drives quality improvement, there should be a focus on gathering and addressing users' complaints about data quality. This could involve initiatives like data marketing research to better understand user needs and preferences, which may ultimately lead to more user-centric data products and services.
In essence, to achieve its goal for public good, democratizing data needs to be about democratizing reliable data, which cannot be accomplished without a robust foundation of data quality assurance. Recognizing the importance of data quality professionals, instituting mechanisms for oversight and improvement, and integrating data quality considerations into research, publications, and organizational practices are all steps we can take to ensure data democratization will truly serve the interests of all its stakeholders, especially the general public.
I am deeply grateful to the three co-editors of this special issue, Julia Lane, Nancy Potok, and Attila Emecz, for the heroic effort of putting together this special issue, and most importantly, for being a central force in data democratization. I also thank the entire HDSR staff team, Rebecca McLeod and Amara Deis, for their tireless work around the clock to ensure the prompt launch of this special issue.
I also thank ChatGPT, whether it is intelligent or not, for helping to correct my Chinglish, and to improve my writing (at least I hope). Here is a typical example of how I solicit ChatGPT’s help: The initial title of this editorial was “Data Democratization: An Ecosystemic Contemplation and Planning.” Because of my love for alliteration, I asked ChatGPT to suggest some synonyms of ‘planning’ but start with the letter C. ChatGPT-4 came back with the word ‘coordination,’ which I at once adopted because it better captures a key feature of an ecosystem, namely the co-evolution of its inhabitants, than ‘planning.’ Nevertheless, all intelligences displayed in this editorial are human, including mine.
Note on the Updated Editorial as of April 12, 2024: The phrase "18 enlivening articles" is changed to "17 enlivening articles" for this update, because one article (posted with abstract only) requires more time to process than initially anticipated. To maintain the consistency of the online version and the printed version of this special issue, which has a strict production timeline, this article now will be processed separately.
Xiao-Li Meng has no financial or non-financial disclosures to share for this editorial.
Adamson, P. (2014). Classical philosophy: A history of philosophy without any gaps (Vol. 1). Oxford University Press.
Auerbach, J. (2023). Safeguarding facts in an era of disinformation: The case for independently monitoring the U.S. statistical system. Harvard Data Science Review, 5(3). https://doi.org/10.1162/99608f92.5cc8971c
Bender, S., Blaschke, J., & Hirsch, C. (2024). A practical use case: Lesson learned from social science research data centers. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.8a2f4507
Borgman, C. L., & Brand, A. (2024). The future of data in research publishing: From nice to have to need to have? Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.b73aae77
Chenarides, L. (2024). Creating engagements: Bringing the user into data democratization. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.24c26aa9
Emecz, A., Mitschang, A., Zdawczyk, C., Dahan, M., Baas, J., & Lemson, G. (2024). Turning visions into reality: Lessons learned from building a search and discovery platform. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.d8a3742f
Fay, R. E. (1992). When are inferences from multiple imputation valid? In Proceedings of the Survey Research Methods Section (1992) (pp. 227-232). American Statistical Association. http://www.asasrms.org/Proceedings/papers/1992_034.pdf
Foundations for Evidence-Based Policymaking Act of 2018, Pub. L. No. 115-435, 132 Stat. 5529 (2019). https://www.congress.gov/bill/115th-congress/house-bill/4174
Frieder, O. (2024). On democratizing data: Diminishing disparity and increasing scientific productivity. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.44746f24
Hausen, R., & Azarbonyad, H. (2024). Discovering data sets through machine learning: An ensemble approach to uncovering the prevalence of government-funded data sets. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.18df5545
Kott, P. S. (1992). A note on a counter-example to variance estimation using multiple imputation [Technical report]. United States National Agriculture Service.
Lane, J., & Potok, N. (2024). Democratizing data: Our vision. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.106473cf
Lane, J., Feldman, S., Greenberg, J., Sotsky, J., & Dhar, V. (2024). The importance of philanthropic foundations in democratizing data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.3f34436c
Lane, J., Spector, A., & Stebbins, M. (2024). An invisible hand for creating public value from data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.03719804
Loukissas, Y. A. (2019). All data are local: Thinking critically in a data-driven society. MIT Press.
McNutt, M. (2020). Self-correction by design. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.32432837
Meng, X. L. (1994). Multiple-imputation inferences with uncongenial sources of input (with discussions). Statistical Science, 9(4) 538–573. https://doi.org/10.1214/ss/1177010269
Meng, X.-L. (2019a). Data science: An artificial ecosystem. Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.ba20f892
Meng, X.-L. (2019b). Five immersive 3D surroundings of data science. Harvard Data Science Review, 1(2). https://doi.org/10.1162/99608f92.ab81d0a9
Meng, X. L. (2021). Enhancing (publications on) data quality: Deeper data minding and fuller data confession. Journal of the Royal Statistical Society: Series A (Statistics in Society), 184(4), 1161–1175. https://doi.org/10.1111/rssa.12762
Meng, X.-L. (2024). 2024: A year of crises, change, contemplation, and commemoration. Harvard Data Science Review, 6(1). https://doi.org/10.1162/99608f92.239082d0
Mulvany, I. (2024). Scale, publishing, and open data. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.9e9ae62b
Nagaraj, A. (2021). The private impact of public data: Landsat satellite maps increased gold discoveries and encouraged entry. Management Science, 68(1), 564–582. https://doi.org/10.1287/mnsc.2020.3878
Nagaraj, A. (2024). A mapping lens for estimating data value. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.82f0de5a
Pallotta, N., M Locklear, J., Ren, X., Robila, V., & Alaeddini, A. (2024). Discovering data sets in unstructured corpora: Discovering use and identifying new opportunities. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.77bfa1c9
Parashar, M. (2024). Enabling responsible artificial intelligence research and development through the democratization of advanced cyberinfrastructure. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.9469c089
Potok, N. (2024). Data usage information and connecting with data users: U.S. mandates and guidance for government agency evidence building. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.652877ca
Potok, N., Carr, P., Hamer, H., Rivers, E., & Stefanou, S. (2024). The view from four statistical agencies. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.7aadd8ff
Puebla, I., & Lowenberg, D. (2024). Building trust: Data metrics as a focal point for responsible data stewardship. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.e1f349c2
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. Wiley. https://doi.org/10.1002/9780470316696
Sostek, K., Russell, D. M., Goyal, N., Alrashed, T., Dugall, S., & Noy, N. (2024). Discovering datasets on the web scale: Challenges and recommendations for google dataset search. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.4c3e11ca
Xie, X., & Meng, X. L. (2017). Dissecting multiple imputation from a multi-phase inference perspective: What happens when God's, imputer's and analyst's models are uncongenial? Statistica Sinica, 27(4) 1485–1545. https://doi.org/10.5705/ss.2014.067
Veselovsky, V., Ribeiro, M. H., & West, R. (2023). Artificial artificial artificial intelligence: Crowd workers widely use large language models for text production tasks. ArXiv. https://doi.org/10.48550/arXiv.2306.07899
Zdawczyk, C., Lane, J., Rivers, E., & Aydin, M. (2024). Searching for how data have been used: Intuitive labels for data search and discovery. Harvard Data Science Review, (Special Issue 4). https://doi.org/10.1162/99608f92.f1cbbfbb
©2024 Xiao-Li Meng. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.