Issue 2.2 / Spring 2020
Most of us like information and dislike uncertainties. The arrival of COVID-19 has accentuated this contrasting preference at all levels, from isolated individuals to international organizations. As individuals, we need information and guidance on how we can protect ourselves, our loved ones, and our communities. As organizations, we also need information to deal with the outbreak and plan for the aftermath. Uncertainties overcasting the pandemic add so much stress to all of us, and they make most planning extremely daunting because of the grave consequences of any serious predictive error, from more lives lost to social or economic systems collapsing. Even businesses that owe their existences to uncertainties, such as insurance, need statistical predictability to operate and survive.
With so much at stake, it is hard to maintain a dispassionate perspective and to remember that, as far as data science is concerned, information and uncertainty—or equivalently, signal and noise—are merely the two sides of the same coin: variation. Even the most exquisite description of a person’s facial features would provide little information for distinguishing that person from an identical twin. Yet I would argue that as the stakes of prediction and inference rise, so too does the necessity for data scientists to maintain this fundamental perspective with respect to variation. Data science relies on the discriminating power provided by variations in the data to function, in contexts ranging from linear regressions to deep learning. But data themselves do not tell us which variation is signal and which one is noise. Indeed, what is noise for one study (e.g., population behavior) can be signal for another (e.g., for individualized treatments), and vice versa.
Therefore, for any given study at hand, a paramount task for data scientists is to determine judiciously which parts of the variation are the information we are seeking, and which parts represent uncertainties that we must take into account. This determination requires both domain knowledge and data acumen, which in turn requires a profound appreciation of this ’two-sides‘ perspective. Once this perspective has been internalized, it can constantly remind us of how easy it is to misclassify or overfit, literally and figuratively, especially when we are driven by passion, anxiety, or even fear. With so much at stake, who would prefer to delay effective treatments or vaccines for COVID-19? But all experienced and responsible (data) scientists are keenly aware that rushed studies have far more chances of mixing noise and signals, or grossly mis-representing (or irresponsibly promoting) potentially fatal uncertainties. Dissecting and discerning variations is a game of chance played with God or Nature, neither of which is known to forgive those who don’t take the time necessary to learn the rules of engagement.
I am therefore very grateful to all the authors of this issue of HDSR, who collectively provide an extremely rich and diverse set of articles that help to deepen our thinking and broaden our minds with respect to the increasingly perplexing world of variation; this increase is due to the purposeful introduction of variations, either for noble causes, such as protecting data privacy, or villainous ones, such as misinformation. For example, for years, my go-to example for illustrating the difficulty of separating signal from noise was the proper quantification of climate change, because its signal-to-noise ratio is—thank God—so small. Temperatures at most locations can vary by double digits in just one day, and yet if the average global temperature were to increase by one degree Celsius per decade, HDSR would not see its centennial. However, the article in the Recreations in Randomness column by Jason Bailey on predicting the outcome of art auctions provides an even more extreme example. Which part is the signal when a banana duct-taped onto a wall can be auctioned for $120,000-150,000—the banana, the duct tape, and/or the wall? Of course, it’s the artist, you may say. But how would you predictively price this artist’s next work, keeping in mind that the level of outrage is simultaneously unknown and instrumental to the actual selling value?
Dan Bouk’s historical account on the U.S. Census, especially the story about counting birth nation in the 1920 census for those born in Alsace-Lorraine, is another article that enriched my collection of variations in uncertainty itself. Bouk reminds us that the concept of ‘uncertainty’ goes far beyond those uncertainties which can be captured meaningfully by various metrics for ‘errors.’ Uncertainties caused by territory disputes are the kind of the problems, perhaps most ironically, that possesses no ‘ground truth,’ and hence there is no ‘error’ to speak of. The problematic designation of ‘Germany’ or ‘France’ as birth nation affected over one million people; such figures matter because (future) immigration quotas were set in proportion to the country origins counted by the 1920 census. “As a result,” Bouk points out, “the statisticians’ excision of uncertainty took on enormous political and practical significance.”
Exactly a century later, the US Census Bureau is tackling another issue of variation that will have enormous political and practical significance. As briefly summarized in my last editorial, the 2020 US Census will be protected by differential privacy, which is to say that a controlled amount of random noises will be introduced into the census counts before their public release. Although the 2020 US census only kicks off officially this April, the leading article in this issue by Michael Hawes highlights many lessons already learned from building and testing the differential privatization mechanism employed in this context. A grand challenge here is how to introduce proper amounts of variation that help dilute private individual information while not unduly destroying the valuable aggregated information that is crucial for policy making and for research projects that rely on the census data. It is an extremely daunting task, exactly because of the interwoven nature of signal and noise as represented by variations. There are many challenging problems, ranging from ethical considerations to algorithmic implementation and to effective communication with diverse stakeholders. Hawes’ article calls for the broad engagement of the data science community in this matter, and I certainly encourage interested readers to be in touch with the US Census Bureau via the email address listed in the article.
Algorithm fairness is another consideration that potentially carries enormous political and practical significance. The difference between respecting and ignoring variations in this context can be the difference between doing good and doing harm. In popular media, ‘fairness’ often is portrayed as ‘equality.’ When I graduated from Fudan University, our entire graduating generation was given the exact amount of monthly stipend regardless of what our jobs were, how well each of us did, what our needs were, etc. Whereas I doubt that many would consider that kind of forced ‘equality’ to be fair, when it comes to setting fairness metrics for computer algorithms or statistical procedures, more people fall into the trap of ‘fairness = equality,’ demanding various equalities that are either mathematically impossible, or can do real harm to those that are meant to be protected.
Romano, Barber, Sabatti, and Candès’ article, “With Malice Toward None: Assessing Uncertainty via Equalized Coverage” aptly reminds us that mathematical equality can be desirable as a fairness measure, but it can also be wishful thinking or even be harmful. When we present a procedure for constructing a 95% predictive interval within a population, then it should have at least approximately the claimed 95% coverage, regardless of the sub-populations to which we apply it. This is not just being fair. It is about quality control of our algorithms and procedures, and delivery of promised goods. By analogy, if a pharmaceutical company posits that a treatment for COVID-19 is 70% effective for the adult population without any further specification, then we expect that it be effective approximately 70% of the time for any adult age group. If it turns out to be effective only for 30% of patients over age 65, then the 70% claim is deceptive; and if such deception is done intentionally for profit, then it is an unethical, even criminal, act. That is, even without the fairness considerations relative to the task of protecting various sub-populations, we should aim for guaranteeing the declared coverage conditioning on their attributes. This is the well-understood conditioning principle in statistics, which makes our results more relevant for the study at hand. Romano, Barber, Sabatti, and Candès’ article shows how to achieve such guaranteed conditional coverage (approximately) via conformal prediction.
Romano, Barber, Sabatti, and Candès’ article further discusses the issue of lacking “equal length.” Any probabilistically principled procedure that equalizes the coverage will lead in general to varying lengths of the intervals for different sub-populations. Instead of demanding equalized length in additional to equalized coverage, the article suggests that a sensible goal is to understand and reveal the reasons for the varying lengths, rather than artificially masking them. Anyone with basic knowledge of predictive distributions would immediately realize that, generally, it is mathematically impossible to require different predictive distributions to produce predictive intervals that come with identical coverage and length. Only when the predictive distribution is invariant to sub-populations can we achieve simultaneous equalization in length and coverage at any level. But if the invariance holds, then the sub-population identities carry no information for prediction, and hence, they should not be used in the first place, regardless of the fairness considerations.
That’s exactly how variation works. It does not self-censor or obey our wishes. It does not conveniently appear when it can provide beneficial information, and then disappears quietly when it may reveal ‘inconvenient truths.’ If we revisit the hypothetical treatment example above, it may indeed be the case that 70% effectiveness could be achieved for those who are above 65, as long as double dosages are given, and that said dosage is explicitly called for by the treatment, and that, if we use the single dosage, we would only achieve 30% success rate for elderly patients. In such cases, I surmise that few would insist on single dosage for the elderly or double dosage for everyone in order to equalize the dosages for all age groups, as either equalizing-act would be medically harmful because it does not respect the biological variations. Whereas we can always work harder to seek effective treatments with minimal dosages, blindly insisting on equalizing everything without understanding the sources of variations is counterproductive, and it serves little or not at all the betterment of science or society. The only purpose it may serve is to remind us time and again of the critical importance of general education aimed at understanding and appreciating variations, so that fewer people would knowingly equate effort to ensure societal fairness with nonsensical demand for technical equality.
An effective, and arguably the only coherent, way to depict and contemplate variations is through probabilistic reasoning, which represents objective frequencies or subjective uncertainties in terms of distribution. Nathan Sanders’ timely article, “Can the Coronavirus Prompt a Global Outbreak of ‘Distributional Thinking’ in Organizations?”, suggests as much. Sanders defines “distributional thinking” as a major part (but not all) of statistical thinking, which involves understanding inferential and predictive results as distributions (e.g., posterior distribution, confidence distribution, and fiducial distribution) instead of as point or interval estimators. Broader thinking along these lines would encourage investigators and policy makers to take into account important issues that are best seen via the lens of distribution, such as dependence among factors, impact of assumptions, and uncertainty assessment and communications.
Sanders reflects on examples from COVID-19 to argue that distributional thinking is necessary for organizations; he points to social justice movement for lessons on how to propagate this mindset. This emphasis on framing collective thought and action is echoed by another article in the Stepping Stones section, Thomas Davenport’s call for building and educating data science teams instead of wasting resources on seeking the jack-of-all-trade “unicorn” data scientists. Training a data science team is a major undertaking, especially as the data science community is still at its big-bang stage. Davenport discusses the need for classification and certification, which he argues are the only scalable ways for business organizations successfully to assemble skillful data science teams.
To address the lack of society-wide classification and certification systems, Davenport recognizes some initial efforts that have already been made, including the Initiative for Analytics and Data Science Standards (IADSS). Fortunately for HDSR, the IADSS Initiative co-leaders, Usama Fayyad and Hamit Hamutcu, wrote the article that describes IADSS’s mission and reviews a classification system of knowledge and skills for analytics and data science professionals. This classification system informs a proposal for a Data Science Knowledge Framework, which would be employed in establishing industry standardization and building assessment methodologies for data professionals. The idea is to develop a more principled approach towards understanding what data science is about (in the business world) and avoid the increased confusion caused by many people and projects with the label of ‘doing data science.’ Not surprisingly, Fayyad and Hamutcu’s and Davenport’s articles put business understanding and knowledge high on their respective lists of essential skills and activities in data science for business applications. This reminds us once again that the judicious determination of the boundary between signal and noise in the perplexing world of variations typically requires deep domain knowledge.
The call for team work and domain knowledge is also the theme of Goldstein, LeVasseur, and McClure’s article on the convergence of epidemiology, biostatistics, and data science from a curriculum design and pedagogical training vantage point. They review and compare curricula of master’s-level epidemiology, biostatistics, and data science degree-granting programs at the top-ranked public health programs (according to the U.S. News and World Report in 2019), and conclude, “Collaboration and cross-training offer the opportunity to share and learn of the constructs, frameworks, theories, and methods of these fields with the goal of offering fresh and innovate perspectives for tackling challenging problems in health and health care.” The COVID-19 pandemic vividly demonstrates the need for and power of cross-discipline collaborations on the global scale, where epidemiologists, statisticians and—more broadly—data scientists play indispensable yet collaborative roles.
While colleges and universities are essential in providing statistical and data science education, we need to introduce the concepts of variation, uncertainty, and distributional thinking into early child education in order to fully prepare citizens of the digital age at scale. This may sound completely unrealistic, and currently it is—not because children lack the cognitive ability to learn these concepts, but rather because teachers may lack familiarity with and mastery of these concepts. Teaching students how to draw histograms may be easier than teaching mathematics, because the former is visual and participatory. Activities such as collecting data from classmates (e.g., how many books one has) for the purpose of drawing histograms can be integrated into engaging social programs. A fundamental defect with the current curricula is that we consistently inculcate our children exclusively with deterministic thinking during their most formative years of brain development. Only after they become fluent with deterministic thinking do we tell them, “Oh, by the way, the world is full of uncertainty and you just need to deal with that.” The massive shortage of teachers qualified to teach distributional thinking testifies to our current failure to institute its practice —one more reason that we need to rethink our curricula from ground zero.
To encourage all of us to think about teaching distributional thinking from day one, in this issue we launch a new column, Minding the Future, which targets pre-college students and their teachers and parents, as well as any novice to data science. In the inaugural article, Column Editor Nicole Lazar explores the big scene of data science and its growing role in our daily lives. The column aims to inspire young minds and those who are new to data science to see data science as a necessary intellectual pursuit for educated citizens of the digital age, regardless of their particular interests or career aspirations. In future issues, this column will feature articles on how effectively to design and teach pre-college curricula in data science.
Given the ever-evolving nature of the global data science community, it should come as no surprise that there are very large variations in every possible aspect of data science endeavor. As such, to determine which part is signal and which part is noise is an extremely daunting, if not impossible, task. The article by Gregory, Groth, Scharnhorst, and Wyatt, “Lost or found? Discovering Data Needed for Research” tells a vivid tale and marks a brave incursion into Dataland. Data are the fuel of data science, obviously. Far less obvious are the commonalities in the ways in which researchers and users find data crossing different disciplines, environments, and cultures, when they cannot collect data on their own. Such questions are important for designers of data discovery systems and repositories, as well as for open science in general. How would one even think about collecting data to answer such questions? What is the sampling frame for ‘data users’ on the global scale?
This article took on this challenge by painstakingly conducting the largest known survey of its kind, obtaining over 1650 responses globally. These responses provide rich data for qualitative analysis and descriptive statistics, on the basis of which they demonstrate vary large variations in practice, or as the article emphasizes, “Diversity is normal, not abnormal.” The article provides readers with guidance and caution on digesting the nature and extent of the reported discoveries, because the responses represent about a 1% response rate to the global survey. Anyone familiar with why (and when) survey sampling works would immediately worry about the potentially extreme selection bias that arises from missing 99% of the responses. They should, and having worked on survey nonresponses and missing data in general since my Ph.D. thesis (may I say a century ago?), I’m among the first to jump on the issue of selection bias any time I see a survey, large or small.
But it is worthwhile to emphasize that the issue is not the 1% or any rate per se. After all, most surveys we rely on represent far less than 1% of the population that we target; a sample size of 1650 would be considered quite large for almost any kind of opinion survey, including surveys predicting how hundreds of millions of people would vote. The key question is whether the variations revealed in our sample captures the variations in our target population. However, we also need to be careful about what we mean by ‘capture.’ In seeking to be ‘representative,’ we typically seek to ensure that the empirical distributions (e.g., histograms) from our sample resemble the population distributions, which is the kind of representativeness likely to be destroyed first by nonresponses or other kind of selection mechanisms (e.g., selective testing for COVID-19).
There are, however, a variety of complicated situations in which we do not even know what the possible variations are, or, in technical terms, what the ‘state space’ is, let alone the distribution that sits on it. We may not know the range of the varieties of practice, the nature of diverse opinions, or the full range of considerations that we should anticipate. In such cases, as long as most possible states are captured in our sample, the mismatch between their frequencies in our sample and in the population (e.g., an extreme opinion may appear more often in a sample than in the population) might do much less harm than we fear, as long as we explicitly acknowledge that we are aware of the potential mismatch, so we don’t misrepresent an extreme view as a common opinion. In such cases, exploratory studies which employ mixed qualitative and quantitative approaches are essential, providing much food for thoughts and directions for further studies, as demonstrated in Gregory, Groth, Scharnhorst, and Wyatt’s article.
The challenges of studying data reuse are well reflected in another article. Lisa Federer summarizes the findings from an interactive workshop on data citation and metrics for data reuse, another form of exploratory study that aims at identifying the state of variations. Whereas the socially constructed system of standards and metrics for research (article) citations is well-established, the data science community is still at the stage of collecting recommendations from various stakeholders about how to set up commonly acceptable standards and metrics for data citation. The very fact that Federer’s article documents 68 recommendations indicates the scope and complexity of this task; there are so many recommendations precisely because of the large variations in the stakeholders’ interests and needs. It is also another telling reminder that data science is a complex ecosystem, since the stakeholders here include publishers, researchers, funders, repository administrators, librarians, industry and government users, etc.
Such variations and the associated challenges are by no means restricted to the issue of reusing data collected by others. They are ever present in endeavors to collect data on one’s own. Even those of us who work intensively on data collection and data quality in a variety of fields have our blind spots when surveying the vast landscape of data collection. In reading Horrell, Reynolds, and McElhinney’s article, “Data Science in Heavy Industry and the Internet of Things,” I encountered a kind of data collection challenge that I had never heard of before, despite my 30 years of experience in working with a large number of specialists, from astrophysicists to psychiatrists. A distinctive feature of heavy industry is that heavy equipment may contain hundreds or even thousands of sensors, and collecting data reliably from these sensors is not a trivial matter, especially for older non-mobile machines. It often involves retrofitting old hardware to enable the transmission of data to the cloud. This can create data gaps, cause wrong orders of data arrival (due to delay in transmitting), and worst of all, result in the so-called ‘non-ignorable missing-data mechanisms’ because data are missing precisely when the machines behave erratically. This is just one of the ten data and implementation challenges listed in the article. Given that the “stereotypical environment for a Data Scientist is decidedly not heavy industrial,” chances are that a deep dive into this article will enrich your appreciation of the large variations within the data science community itself.
Katie Malone and Rich Wolski’s article on the open-source software community is another eye-opener for me. It provides yet another vital look at the data science ecosystem while examining some issues parallel to those explored in Federer’s article. For example, software citation is even less a common practice than data citation, yet it is no less important or complex an issue, involving essentially the same set of stakeholders, such as researchers, funders, repository administrators, and librarians. Malone and Wolski repeatedly emphasize that while the open-source software community is an ecosystem in and of itself, with its own evolutionary incentives and equilibrators in force, it is a delicate one given that it is almost entirely based on volunteers. Studying and understanding the open-source ecosystem requires an approach as nuanced as that required to study the data reuse ecosystem, as Gregory, Groth, Scharnhorst, and Wyatt explore in their article. Again, such study and understanding requires a blend of qualitative and quantitative approaches. Most importantly, proper understanding of our roles in the ecosystem can help us to support and contribute to the community as a whole. Such support and contributions are essential to “keep our field healthy for years to come,” as Malone and Wolski emphasize.
The current COVID-19 pandemic provides a penetrating (and painful) reminder that there is nothing more important than health, whether for individuals, organizations, or entire societies. I therefore echo Malone and Wolski’s emphasis by thanking all of HDSR’s authors, editorial team, reviewers, and readers for your great support and contributions, which are vital to keep HDSR as healthy as the data science ecosystem itself, for generations to come.
This editorial is © 2020 by Xiao-Li Meng. The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.