Issue 2.1 / Winter 2020
In the late morning of January 1st, 2000, I sat on the floor of my Chicago office, surrounded by piles of folders, books, and journals. The tediousness of housecleaning for the millennium ahead was just what I needed to distract me from my Y2K anxiety. It took four days to make the piles—and my anxiety—disappear. No massive or serious network failures occurred during that time.
You might be among those who believe that the avoidance of failure was the result of preventive effort due to the awareness (and fear) of Y2K or, alternatively, among those who think the risk of failure was grossly exaggerated in the first place. Regardless of your judgment, I welcome you to the first issue of HDSR in 2020. This is yet another anxiety-inducing year, and I am not even talking about predicting elections and other political outcomes around the globe.
Within the data science community, 2020 induces several Y2K-ish anxieties. One of them is discussed in this issue, while others will be the subjects of the next few issues. The United States Census Bureau has announced that it will endorse differential privacy as the disclosure avoidance mechanism for the 2020 Census. In a layperson’s terms, this means that census counts will be infused with random noise prior to release. The aim is to reduce (substantially) the chances that sophisticated hackers can recover individual data from the reported aggregate counts. But as always, there is no free lunch. The injection of noise makes the data less useful and, if not analyzed/interpreted properly, misleading. As emphasized in the leading discussion article by Provost Teresa Sullivan, ensuring reliable Census data underpins “our democracy (and republic),” a point echoed and expanded upon by nine scholars and census experts from several countries. (The use of differential privacy in one country can obviously be tried in other countries.) Indeed, conducting reliable decennial censuses is mandated by the US Constitution. Why then is the Census Bureau allowed to add noise to essentially every data point produced?
The answer is not hard to find. The law also requires that the Census Bureau protect data privacy, a requirement that I am sure a vast majority of us would appreciate. Even if there were no such legal requirement, it would still be essential for any census bureau or data collectors to ensure as much privacy as possible, for the very purpose of improving data quality and quantity. How many of us have ignored surveys and opinion polls in the past, even in the absence of privacy concerns? Humans are the ultimate dilemma creators: we like to know more about others, and we dislike others knowing more about us. But others’ others are us. The Census Bureau, therefore, has a mathematically impossible task: to produce accurate and privacy-protected data.
Neither zero privacy nor 100% privacy—that is, zero information— is acceptable, which means that we must seek a compromise between these two extremes. Differential privacy offers just such a framework, requiring that the difference in our statistical summaries between including and excluding a single individual is probabilistically within a pre-specified threshold. This threshold quantifies the trade-off we are willing to make, that is, how much privacy we are willing to sacrifice in order to retain the utility of the data. But this quantification is a daunting—if not impossible—task, since the sensible quantification must depend on the purpose of using the data. Yet the Census Bureau must release data regardless of how the data will be used (and abused). It is therefore really a question for our society: collectively, how should we trade privacy for information? A first step towards such a decision is a general discussion and debate among social scientists, perhaps the largest community of users of the census data.
The article by Daniel L. Oberski and Frauke Kreuter, “Differential Privacy and Social Science: An Urgent Puzzle,” is thus extremely timely. It provides an overview of what social science can bring to the differential privacy conundrum as well as insights into the question of how differential privacy may impact social science. The negative impact is, of course, much to be expected, since adding noise tends to make the analysis more complicated and less accurate. However, the authors argue that differential privacy can also make the analysis less sensitive to individual observations, which can add robustness and stability to results. This in turn can reduce their sensitivity to assumptions that are often made for simplicity (e.g., for mathematical tractability) rather than based on reliable prior information or theoretical understanding.
In a nutshell, the introduction of differential privacy to the 2020 census is generating both opportunities and challenges, and even a bit of fear. Some of these opportunities and challenges were captured in the inaugural symposium of HDSR, and plans for a special issue are underway. Given the ubiquity of the use of census data in our research and in life more generally, from building economic indices to dealing with gerrymandering, it’s not a risky prediction to say that discussions and debates on the use of differential privacy protection will go on, till a next generation of privacy protection paradigm arrives. But this by no means is the only debate in data science that will keep us busy in 2020.
Perhaps there are no statistical concepts or methods that have been used and abused more frequently than statistical significance and the p value. So much so that some journals are starting to recommend authors move away from rigid p value thresholds by which results are classified as significant or insignificant. The American Statistical Association (ASA) also issued a statement on statistical significance and p values in 2016, a unique practice in its nearly 180 years of history. However, the 2016 ASA statement did not settle the matter, but only ignited further debate, as evidenced by the 2019 special issue of The American Statistician. The fascinating account by the eminent philosopher of science Deborah Mayo of how the ASA’s 2016 statement was used in a legal trial should remind all data scientists that what we do or say can have completely unintended consequences, despite our best intentions.
The ASA is a leading professional society of the studies of uncertainty and variabilities. Therefore, the tone and overall approach of its 2016 statement is understandably nuanced and replete with cautionary notes. However, in the case of Scott Harkonen (CEO of InterMune), who was found guilty of misleading the public by reporting a cherry-picked ‘significant p value’ to market the drug Actimmune for unapproved uses, the appeal lawyers cited the ASA Statement’s cautionary note that “a p value without context or other evidence provides limited information,” as compelling new evidence that the scientific theory upon which petitioner’s conviction was based [that of statistical significance testing] is demonstrably false. I doubt the authors of the ASA statement ever anticipated that their warning against the inappropriate use of p value could be turned into arguments for protecting exactly such uses.
To further clarify the ASA’s position, especially in view of some confusions generated by the aforementioned special issue, the ASA recently established a task force on statistical significance (and research replicability) to “develop thoughtful principles and practices that the ASA can endorse and share with scientists and journal editors” within 2020. As a member of the task force, I’m particularly mindful of the message from Mayo’s article, and of the essentially impossible task of summarizing scientific evidence by a single number. As consumers of information, we are all seduced by simplicity, and nothing is simpler than conveying everything through a single number, which renders simplicity on multiple fronts, from communication to decision making. But, again, there is no free lunch. Most problems are just too complex to be summarized by a single number, and concision in this context can exact a considerable cost. The cost could be a great loss of information or validity of the conclusion, which are the central concerns regarding the p value. The cost can also be registered in terms of the tremendous amount of hard work it may take to produce a usable single summary.
Such is the case with respect to the GDP (Gross Domestic Products), an index we have all heard many times, while lacking a good understanding of the intense labor and hard choices involved. The inaugural article for the Effective Policy Learning column by Brian Moyer and Abe Dunn, respectively the Director and Assistant Chief Economist of the U.S. Bureau of Economic Analysis, fills this gap. The article outlines the means by which data science can help to ensure that the next generation of economic statistics will be even more relevant, timely, accurate, and detailed. This new column is one of the three HDSR debuts in this issue. Its dual aim is to equip policy makers with a better understanding of the power and perils of data science, and to inspire data scientists to consider the public sector as a vital place to employ their skills and maximize their impact.
Few data scientists would not like to maximize their societal impact. But to do that well, we data scientists must gain and sustain public trust, regarding the quality and beneficence of what we deliver, especially if it is a black box. David Spiegelhalter, Chair of the Winton Centre for Risk and Evidence Communication at the University of Cambridge, launches into this discussion with a question that is increasingly being asked by both data scientists and the public: “Should we trust algorithms?” The correct answer, of course, is “It depends!” Spiegelhalter proposes a four-phase evaluation structure to guide our assessment. He also emphasizes the distinction between the trustworthiness of claims made about an algorithm, and those made by an algorithm.
Perhaps not coincidentally, both kinds of trustworthiness arise in the discussion article by Cynthia Rudin, Professor of Computer Science, Electrical and Computer Engineering, and Statistical Science of Duke University, and her co-authors, Caroline Wang and Beau Coker. This paper casts a new light on fairness issues surrounding the COMPAS algorithm that is used widely throughout the US criminal justice system —specifically, the paper considers fairness through the lens of transparency. Whether or not a judge trusts the claims (in the form of model scores) made by COMPAS has serious and, in some cases, even life-and-death consequences. But the lack of transparency means that the judge’s trust—or lack thereof—can never be fully informed. Rudin et. al. demonstrate the danger of lack of transparency by examining claims made about COMPAS by both its creator and its critics, finding statistical evidence to suggest that neither claim is beyond reasonable doubt. The six discussions by the creator of COMPAS, legal scholars and data scientists, together with the authors’ rejoinder, are a must-read for anyone who wishes to gain understanding of the very complex and critical enterprise of using data science in the justice system.
Schuemie et. al’s benchmark study on the trustworthiness of a host of methods for observational studies in healthcare will heighten understanding of another vital matter, even as it prompts alarm. In its “Media Summary” (an experimental HDSR feature designed for technical articles), the authors reported that “most methods are not reliable,” and that “more often than not, the known ‘true’ answer lies outside the confidence interval, despite the fact that such confidence intervals are typically designed to include that true answer 95% of the time.” This finding echoes the studies that led to the outcry for the scientific community to practice serious introspection regarding scientific reproducibility and replicability, as documented in Reproducibility and Replicability in Science, a study report by the National Academies of Sciences, Engineering and Medicine (NASEM) of the United States. A future issue of HDSR will focus on this theme, because data science plays an indispensable role in scientific reproducibility and replicability. (For a disambiguation between the two concepts, consult the NASEM report.)
I must add that to ensure that these two studies themselves are trustworthy, both articles were seriously scrutinized by multiple experts, and both underwent unusually extensive revisions to ensure high quality in both the content and presentation. Although it is against my professional ideology to guarantee or state almost anything with 100% certainty, I am deeply grateful to the authors and the reviewers for their time and effort. I hope that all data scientists would be willing to make similar efforts to ensure maximal trustworthiness of their findings and products.
As summarized in my previous editorials, data science is an ecosystem consisting of many contributing and diverse communities. It is therefore essential to build alliances and communications among different communities to ensure a harmonious co-evolution, and critical to building interoperability into the infrastructure on which data-driven research depends, or at least in preventing unnecessary barriers, unproductive frictions, or redundant efforts. The Research Data Alliance (RDA) is such an international organization, started by eight founders in 2013, and now consists of over 9400 members from more than 130 countries and regions. It is a community-driven organization “dedicated to the development and use of technical, social, and community infrastructure promoting data sharing and data-driven exploration.” The interview of Francine Berman—one of its founders and former co-chair of RDA’s leadership council—conducted by Mercè Crosas, Harvard’s Chief Data Officer and an advisor to HDSR—provides a rich account of the history of the RDA and its future plans, while delineating the benefits and challenges of building a community-based alliance, especially on the global scale.
A key challenge to building multi-community alliances is the fact that different communities employ different languages and metrics. This phenomenon is most vividly documented in the article by Katie Malone for Active Industrial Learning, the second inaugural column devoted to translating business concerns into the language of data science, and vice versa, as well as to empowering data science leaders at companies of all sizes and data maturity levels. Katie Malone—Column co-editor, Director of Data Science at Tempus Labs, and co-host of Linear Digressions (a weekly podcast on data science)—urges both data scientists and business stakeholders to avoid the fate of things being ‘lost in translation’ by revising their respective metrics. Malone aptly remarks that “the metrics which quantify outcomes are generally very different for data scientists and business stakeholders, making it likely that each side struggles to understand and speak in terms that are familiar to the other side.” While she acknowledges that this task presents challenges, Malone suggests that the prospective benefits merit the investment of time and effort entailed.
An effective tool to facilitate communications and help to accomplish the task of metric revision is data visualization. In the column, Diving into Data, HDSR’s Data Visualization Editor Antony Unwin explains the great importance of data visualization. In showing what is important in data visualization, he provides an accessible and refreshing tutorial on using visualization in data science. For example, contrary to the common idiom, “a picture is worth a thousand words,” Unwin emphasizes that “A picture is not a substitute for a thousand words; it needs a thousand words (or more)” because for informative and scientific communications via data visualization, we need to know the context and source of the data, how and why they were collected, etc. Accurate and succinct communication is as difficult as it is essential and rare, with or without visualization. This is why there is an increasing emphasis on both data visualization and communication skills in data science curricula at all levels, an emphasis HDSR will further make in future issues.
Few would doubt that we need more research and development of coherent data science curricula at all levels. The perspective article on data science education by Rafael Irizarry, a leading biostatistician and data science educator (who has taught 20 courses and programs via HarvardX), is therefore very timely. Irizarry emphasizes that “data science is not a discipline but rather an umbrella term used to describe a complex process involving not one data scientist possessing all the necessary expertise, but a team of data scientists with non-overlapping complementary skills.” He then makes recommendations on how to take this understanding of data science as a process into account when designing data science curricula. Irizarry’s list of recommendations is by no means exhaustive or even constitutive of a minimum set, which itself is growing with the evolution of the data science ecosystem. But his recommendations clearly address some fundamental needs (e.g., real-world experiences, practical programing skills) of any data science curriculum.
In a similar vein, Keller et. al., drawing from their experiences from the Biocomplexity Institute & Initiative at the University of Virginia, proposes a framework for the essential steps entailed in doing data science effectively, illustrating them in a case study. These steps include problem identification, data discovery, data governance and ingestion, and ethics. Again, this list is not exhaustive, as compared, for example, to the collection from the trio perspectives—humanities, social sciences and STEM—in the inaugural issue of HDSR (Borgman; Leonelli; and Wing). However, it provides essential building blocks for a comprehensive and effective approach to discover, access, repurpose, and statistically integrate data of all sizes and types, which is a central aim of Keller et al.
With respect to the process of doing data science, the authors identify a central challenge faced by many consumers of the data science research, which is a lack of data acumen. This observation should reinforce the importance of treating the acquisition of data acumen as a central focus of undergraduate education, as articulated in the interview of the co-chairs of NASEM’s report on undergraduate education in data science in the inaugural issue of HDSR. Keller et al also conceptualize three levels of data acumen, corresponding to the three learning groups defined in Harvard Provost Alan Garber’s opening editorial in the inaugural issue of HDSR, as well as to the three skillsets discussed in KNIME CEO Michael R. Berthold’s article in the second issue of HDSR. Collectively, these articles provide further evidence of the ecosystemic nature of data science, especially pertaining to the co-evolution of data science research and education. To maximize the mutual benefit of this co-evolution, we need to prioritize ongoing efforts to build increasingly tighter connections between the research community and education communities in data science. We emphasize the plural nature of ‘education communities,’ as we need to consider education at all levels, from the playroom to the boardroom, in order to prepare future generations fully as educated citizens of the digital age.
Of course, our collective goal in advancing data science research and education is the betterment of our global human society. Among all indices of human happiness, nothing is more fundamental than human health. The recent world-wide fear of the spread of the Wuhan coronavirus is a vivid reminder of how health concerns can trump all else and affect everyone, literally. The quest for effective treatment and heightened longevity has always been a central pursuit of human endeavors. The arrival of big data has created a lot of hope (and hype) regarding the prospect of better medical treatments, especially personalized ones. Few of us would not like treatments specifically designed for us, but any educated digital citizen must wonder how anyone could conclude that a treatment would be effective for me without it ever having been tested on me? After all, each of us is unique (hopefully), so where on earth did whoever tested the treatment find enough guinea pigs exactly like me?
This of course is not a new problem. The Mining the Past column co-editor, Christopher Philips provides a succinct and informative account of the history of precision medicine (a more apt term than personalized medicine), with a refreshing perspective on how biostatisticians and data scientists played a critical role in its development. The roles of data science in medical, life, and health sciences will undoubtedly increase, but it is essential to learn from the past with a sense of both triumph and trepidation.
To end this issue on a lighter note, the third inaugural column, Recreations in Randomness, features an article by Ben Zauzmer, a baseball analyst for the Los Angeles Dodgers and a noted ‘Oscarmetrician.’ Zauzmer explores what support can be provided by data science for the empirical observation that Oscar-winning movies tend to be released closer to the date of the awards ceremony. Is that just conspiratorial trivia or something much more profound? This is one of a series of column articles that are designed to engage the general public, via ‘happy topics’ such as the arts, pastimes, and hobbies, as well as games and sports, to better appreciate the value—and enjoyment—of proper reasoning under uncertainty. After all, the most effective and sustainable learning environment for most of us is the one in which education is inseparable from enjoyment.
Although it is only one month into 2020, I am already looking forward to Christmas, when I can sit on the floor again, and this time surrounded by piles of HDSR articles and submissions (sorry, I love trees, but I am over 50). The excitement for surviving a very busy year distracts me from my anxiety: did HDSR do everything it could have done to capture a milestone (and millstone) year of data science? Of course, with the data you (will) provide, my anxiety can be relaxed by evidence rather than by sitting on a floor for another four days.
Willing to provide feedback? The Bits and Bytes column is always looking for inspiring Letters to the Editor…
This editorial is © 2020 by Xiao-Li Meng. The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.