Issue 2.3 / Summer 2020
If you do not have your list ready, asking, “What do you mean by ‘data science’?” might buy you some time, but it would also invite me to invite you to take a look at the 94 pieces published during the first (academic) year of HDSR. Perhaps “What kind of challenges do you have in mind: Theoretical, methodological, practical, or educational?” would be a safer question. Not really, because my invitation is invariant to the focus of your question: “All of the above! Take a look at the 94 pieces…”
HDSR of course is just a window—and a portal for all aspiring data scientists—to the expanding and evolving world of data science. My excitement over the 94 pieces is not inspired merely by HDSR’s vitality and vibrancy as it enters its sophomore year, but by its consequential reminder of the grand opportunities and challenges ahead of all of us. I am therefore deeply grateful to Jeannette Wing, the Avanessians Director of the Data Science Institute and Professor of Computer Science at Columbia University, for contributing her “Ten Research Challenge Areas in Data Science” to HDSR. As Jeannette emphasized, her list is not meant to be complete or top 10, but rather “a good ten to start the community discussing what a broad research agenda for data science might look like.”
For instance, what are the areas that require—or benefit substantially from—the guiding principles, theoretical framing, methodological insights, historical development, and domain emphases from multiple contributing disciplines of data science? Which ones would survive unscathed the metaphorical straitjacket imposed when coerced into a single discipline?
Data privacy is one such area. Data are born to reveal, but privacy requires us to conceal. Right there, we see how humans have created a permanent headache for ourselves. For those of us who cannot do away with headaches, figuratively or literally, we know too well the necessity of seeking help from multiple sources to mitigate and to move forward. A most succinct indication that data privacy research is an inherently multidisciplinary area was provided by my colleague Cynthia Dwork, a leading authority on differential privacy (DP). I once asked Cynthia why DP was defined by treating data as deterministic, a thought that would disqualify any statistician. Cynthia smiled, “You statisticians worry about averages. We computer scientists worry about the worst-case scenarios.” The profoundness of this contrast outmatches its wittiness.
Computer scientists need to guard against the worst-case scenarios, not only because it only takes a single ‘weakest link’ to breach privacy, but also because they need to foresee the unforeseeable: the risk of breaching the privacy of any future data by releasing the current data, and vice versa. Since we have little idea about all the data that will be released in the future, this thought experiment alone should convince us that we need our privacy guarantee to work for the actual data set we release—not just in some hypothetical average sense—with respect to either the private data or the specific privatization implementation we employ.
In contrast, many common statistical procedures, such as confidence intervals, are useful precisely because they trade the deterministic guarantee—that is, a property that must hold for the actual data at hand—for probabilistic guarantees. A 100% confidence interval has a deterministic guarantee; it will cover the truth for every single data set at hand. But its universality comes with the price of uselessness: it simply includes every possible scenario and hence tells us nothing about where the truth is more likely to lie. In contrast, a 95% confidence interval eliminates infinitely many unlikely values to help us make a much more precise guess. But a guess is not a guarantee. We can never be sure, for any particular data set, if the interval covers the truth or not.
Intriguingly, however, DP’s deterministic guarantee is made possible by a probabilistic approach. The worst-case scenario guarantee is conceptualized and executed via injecting a controlled level of noise into the private data to dilute its information, and hence to control the increased probability of identifying private information due to the releasing of the noise-infused data. Furthermore, the research area for data privacy does not stop at creating privatized data, which cannot be analyzed as if they were real data (albeit the concept of ‘real data’ itself is a tricky one, but that’s for another editorial). A host of methods in statistics can be applied here, such as the EM algorithm (treating injected noise as missing data), Bayesian methods (by integrating out the unknown noise level), de-convolution (when injected noise is additive, which usually is the case), etc.
More importantly, probabilistic thinking and statistical acumen are essential for the analysts of privatized data. Having worked on missing data problems for three decades (but only recently on DP), I have accumulated sufficient diagnostic insights to tell you whether your method is likely going to suffer from an inflated variance (a measure of statistical noise that would diminish with more data) or from a positive or negative bias (a measure of systematic error that only becomes more pronounced with more data), once you tell me what unadjusted method you plan to use on the differentially privatized data (and what the DP process was). This kind of training and the associated toolkits are currently lacking outside of statistics, posing “an urgent puzzle” for the broad social science community in the United States and beyond, because soon the community will need to deal with large amounts of differentially privatized data from the 2020 US Census. (In addition to these two cited articles, HDSR has a forthcoming special issue on differential privacy for 2020 US Census.)
A historical anecdote would further highlight the importance of having multiple disciplines and perspectives in developing DP. The idea of protecting privacy by injecting randomness dates back in the statistics literature to at least 1965, when Stanley Warner published "Randomized response: a survey technique for eliminating evasive answer bias" in the Journal of the American Statistical Association. As the title suggests, it was invented as a survey tool to reduce non-response bias because people tend to shy away from sensitive questions. The method is well-known in the statistics literature, and I have even included it in one of my courses as a way to get my students to respond if they had ever cheated on exams (but don’t ask me for the percentage, because it has been completely privatized by my memory). But statisticians did not develop it into a general method for protecting privacy in a database, because as a branch of statistical research it is isolated from the long statistical literature on disclosure limitation in public data files, such as census data.
With all these historical developments and emerging needs, it comes as no surprise that data privacy is among Wing’s list of 10, and it is also on the accompanying list by Xuming He, Professor and outgoing Chair of the Department of Statistics at University of Michigan, and Xihong Lin, Professor and former Chair of Department of Biostatistics at Harvard University (hereafter, the He-Lin list). I am also pleased to see that Wing’s list includes inference from noisy and/or incomplete data as a major challenge area from a computer scientist’s perspective. While inference is traditionally a central part of statistics, the high-3V (volume, velocity, and variety) reality calls for a significant overhaul as well as the creation of entirely new methods, especially regarding computational scalability and the trade-off between computational efficiency and statistical efficiency. Indeed, these two areas are on He-Lin’s list, going beyond the need for analyzing privatized data and thus demonstrating an increased awareness, from computer science and statistics/biostatistics communities, of the need to work together to advance data science.
Whereas computer science and statistics play vital roles in data privacy (and more broadly in data science), it would be a fatal mistake to think that only computer scientists and statisticians are needed for advancing research on data privacy. The thorniest issue in implementing differential privacy is a problem that neither field has much guidance to offer: what is the right amount of data utility we should give up for protecting data privacy, and vice versa—and right for whom? These are policy questions for our society, or whatever relevant communities, to answer. We will need at least experts and deep thinkers in ethical studies, economics, sociology, political science, and policy studies, as well as leaders and builders in governments and industries to be a part of the conversation. Data privacy is therefore a shining example of the ecosystemic nature of data science, in terms of both the problem scope and expertise required.
As another example in the domain of data privacy, Wing’s discussion of privacy includes a number of critical topics in addition to DP, such as secure multi-party computation, homomorphic encryption, and zero-knowledge proofs. This issue of HDSR also features an article by a team of economists, engineers, and computer scientists from MIT on “Secure Multi-Party Cyber Risk Computation” to address the need for data by corporations, regulators, and policymakers to measure broad cybersecurity risk exposures, yet “private-sector entities are unwilling to share their most-sensitive data due to legal and competitive concerns.” The article is another great demonstration of the multidisciplinary necessity for making headway, with the team delivering a platform that can securely gather data and compute aggregated risk measures without requiring individual firms to disclose their sensitive data, hence greatly incentivizing their participation.
In addition to privacy, the rest of Wing and He-Lin’s lists also share a number of common areas, such as interpretable learning, integrating multiple sources of data, and causal inference. Some of them will be discussed briefly below in connection with other articles in this issue. These areas are by no means mutually exclusive, however. As a matter of fact, the article by Zhiqi Bu, Jinshuo Dong, Qi Long, and Weijie Su in this issue on “Deep Learning with Gaussian Differential Privacy” nicely links the area of machine learning with differential privacy, providing both theory and methods for ensuring and enhancing privacy when reporting results from deep learning applied to private data.
“You must be kidding me, Xiao-Li!” Cooking as a major challenge in data science? Well, thank you for continuing to read my editorial, after a heavy dosage of data privacy and all the CS and Stats jargon (as any fan of Star Wars can testify, a vast ecosystem does mean that there are a variety of languages). In returning the favor (flavor?), let me “appetize” you (and reward you with a surprising recipe—after all, HDSR also stands for Harvard Data Science Recipes).
Imagine that after surviving another long day of Zooming, you are in the mood to treat yourself to a delicious and healthy dinner, and you happen to have some okra in your refrigerator. You never liked the way you cooked okra, because it always gets too slimy. You decide to search for a recipe on your phone. A bunch of things show up, but what catches your eye is a photo of an okra dish where the okra looks crispy and crunchy. Very intrigued, you click on the photo, but it doesn’t go anywhere. “Oh geez—how could I get a recipe for that?” (Hint: by fully consuming this editorial.)
Currently there are over 13 million photos on Instagram with the hashtag #homecooking, which might give you a sense as to why photo-to-recipe retrieval is not an academic gimmick for boosting course enrollments. And this is only one of the delicious challenges discussed in this issue’s Recreations in Randomness column by Shuyang Li and Julian McAuley. Would you also like personalized recipes, especially if you have unusual dietary restrictions? Is that possible, or just another example of personal(ized) wishful thinking?
These two delicious challenges are reflected on He-Lin’s list: Emerging Data Challenges and Quantitative Precision-X. By “emerging data,” they mean data that are not in the traditional numerical forms, but in forms such as picture, audio, text, etc (also see Wing’s “artisanal data” in her discussion of the challenge area of precious data). They also mean data that are created for purposes that traditionally have not received much (academic) attention, such as those created to misinform and mislead, e.g., fake news. Precision-X includes many varieties as well, from health care to education, and from custom services to consumer products.
Another rapidly-growing branch of data science that uses a great deal of image and video data (e.g., Google street view) is urban planning. But, as Fábio Duarte and Priyanka deSouza correctly emphasize in their essay on “Data science and cities: a critical approach,” this is another area of data science that goes significantly beyond computer science and statistics. Since a key aspiration of using data science for urban planning is to transform our societies through evidence-based planning, Duarte and deSouza emphasize that one “must examine the ontological and epistemological boundaries of the big data paradigm,” and its impact on short- and long-term consequences for cities. Such examination will need the expertise and direct input from at least policy makers and social scientists, among others.
Unsurprisingly, social scientists have also been taking on many data science challenges, including applying and understanding machine learning methods for seeking patterns and relationships in sociological data. The Fragile Families Challenge is a collaboration of 160 teams of data and social scientists, involving 457 researchers around the world to use data and machine learning to study one and only one question: How predictable are life trajectories? The findings “really, really surprised” its lead author, Matthew Salganik, Director of Center for Information Technology Policy and Professor of Sociology at Princeton University, as he reported in an interview conducted by Lauren Maffeo and Cynthia Rudin. The surprise came in two forms, as I now quote directly from the interview.
First, “none of the teams could make very accurate predictions, despite using advanced techniques and having access to a rich dataset,” and indeed for the six outcomes of the study, “The most accurate models had R2 in the holdout data of about 0.2 for material hardship and GPA, and close to 0 for the other four outcomes.” Second, “there was very little difference between the predictions made by the different approaches,” and “that the most accurate models were only slightly better than a simple, four-variable regression model, where the four variables were chosen by a domain expert.”
There is so much to process and learn from this timely study, but I will focus on two points. First, there has been an increased amount of attention (and tension, I may say) devoted to the question “Why are we using black box models when we don’t need to?”, to almost paraphrase the title of an article by Rudin and Radin, which, incidentally, has had the highest number of page views among all articles published in HDSR so far. It is therefore not surprising that understanding and interpreting learning made it onto both Wing’s list and He-Lin’s list. A profound question here is whether humankind’s inability to understand a learning algorithm is a sign of hope (for the algorithm’s super-human capacity and intelligence), or hype (for mistaking ignorance as artificial intelligence).
Secondly, when asked by Rudin why such a mass collaboration is necessary, Salganik responded that it is because it is “great for producing credible estimates of predictability. If predictability is higher than you expect, it can’t be explained away by over-fitting, or by researchers selecting a metric that makes their performance look good. Alternatively, if predictability is lower than you expect, it can’t be explained away by the failures of any particular researcher or technique.” In this one answer, Salganik offered a solution to two related challenge areas from He-Lin’s list, Post-selection inference and Study design and statistical methods for reproducibility and replicability. The most dangerous post-selection inference is not the mechanical over-fitting—there are many guidelines and diagnostic and remedial tools. The most dangerous post-selection inference is purposeful over-fitting, for non-scientific and non-statistical reasons, such as for commercial profit or personal agenda. That is the one grand challenge in ensuring scientific reproducibility and replicability, when there are other significant forces in place to undermine science itself. The kind of “mass collaboration” led by Salganik is a powerful means to ensure scientific validity, as it has a much better chance to uncover problems, and guard against intentional or unintentional over-fitting—and more generally selection—biases.
Selection biases, aka cherry-picking, underscore the greatest challenge in data science or more broadly human reasoning: causal inference. It is almost a professional hazard for statisticians to suspect every statistical relationship one finds is associational instead of causal. We do our best to educate ourselves and everyone else to be extremely careful when reporting statistical relationships, and we rarely (never?) would declare a relationship causal without some caveats. For societal benefit, it is necessary to have at least one profession on constant look-out against being fooled by data, or those who are incentivized to fool others with data. This is especially pronounced in the area of causal inference, because we all are inclined to think causally—it is much more satisfying to understand why two things are related than merely knowing they are. It is yet another manifestation of our ancestral desire to simplify life in order to survive it. Any time we voluntarily take an action (e.g., take a medication or implement a policy), we must have had sufficient belief that the action will lead to our desired outcome, or at least not to its opposite. Therefore, as long as we remain human (e.g., before being taken over by AI), we are innately and eternally vulnerable to the seduction of causality.
It is therefore unsurprising that causal reasoning and inference made it onto both Wing’s list and He-Lin’s. It is also one of the two main challenges in “Statistical Sciences: Some Current Challenges” by David Cox, Christiana Kartsonaki, and Ruth Keogh; the other being generalizability versus specificity, which is closely related to causality since some may argue that generalizability is guaranteed only by causality. Cox et al. trace the history of causal inference back to R. A. Fisher’s work on experimental design about a century ago, but they recognize that “recent years have seen much explicit discussion of causality and of associated statistical methods and general reasoning for investigating relevant issues.” This sentiment is reflected in yet another article in this issue on causal inference, but this time it is from industry—LinkedIn to be exact—and its title says it all: “The Importance of Being Causal.” The authors—Iavor Bojinov, Albert Chen, and Min Liu—echo Cox et al.’s emphasis that the grand challenge of causal inference is dealing with confounding factors in observational studies, which are increasingly becoming the dominant type of data, especially those from social media and consumer behavior. They provided several case studies to showcase “how data scientists used observational studies to deliver valuable insights at LinkedIn”, and suggested that “firms can develop an organizational culture that embraces causal inference by investing in three key components: education, automation, and certification.”
Bojinov et al.’s call for firms to invest more in education provides a timely reminder of the importance of education for data science. The great challenges we face in data science education and communication are no less than in data science research and application. The grand challenge, as I discussed in my editorial for issue 2.2, is that our current education system teaches deterministic mathematical manipulation as students’ native language for quantitative reasoning, with probabilistic and statistical thinking as a second or even third language. As a non-native speaker of English, I am still getting the spelling and grammar wrong in 100% of my articles (and I am writing 100% as a statistician) after living in the US for over three decades, imposing so much work on proofreaders and copyeditors. Trying to undo something deeply ingrained in my brain during its formative age is a life-long struggle. While our world is becoming increasingly uncertain and unsettling, as the COVID-19 pandemic painfully and yet effectively reminds us, most of us do not have proofreaders or copyeditors for our mistakes in reasoning under uncertainty. Therefore, the need for effective education and communication about reasoning under uncertainty has never been this pressing and wide-ranging.
I am therefore especially delighted that this issue of HDSR features content that covers the entire education and communication spectrum. The collection starts in the playroom, literally, with the article “Data Science for Everyone Starts in Kindergarten: Strategies and Initiatives from the American Statistical Association” by 2020 ASA President Wendy Martinez and ASA Director of Strategic Initiatives and Outreach, Donna LaLonde. If you wonder how on Earth we can teach data science in kindergarten, well, think about games. Equally worth reading, especially if you are curious about ‘promposals’ (or want to be reminded of the adolescent years you have tried hard to forget), is this issue’s Minding the Future column, which features articles for and by pre-college students. The column’s first student author, Angelina Chen of Princeton High School, humorously reminds her fellow students of the importance of distinguishing association from causality (!), and why being able to think probabilistically might help them excel in both the classroom and the ballroom.
Moving beyond K-12 (kindergarten to 12th grade), in this issue’s Conversations with Leaders, a new series of interviews with HDSR, MIT President L. Rafael Reif stresses the critical importance of teaching “computational thinking, algorithmic thinking, data science thinking” throughout undergraduate studies and graduate disciplinary training. “It's basically, to me, like a new kind of math,” he declared. In another new feature, Emerging Voices, HDSR’s Early Career Board answers the calls from both Wing’s and He-Lin’s articles to ensure earlier and broader participation of young talents in data science research and education.
Of course, the healthy evolution of the data science ecosystem also depends on how effectively we can enhance the public’s understanding of and sophistication about reasoning with data. Timandra Harkness’ call, “Stop flaunting those curves! Time for stats to get down and dirty with the public,” is therefore extremely timely. The author of “Big Data: Does Size Matter?”, Harkness is a presenter, writer, and comedian. I hope you have had the opportunity to enjoy her bit-definition of data (“just information in a transferable form”) or wit-deterrence of skipping her YouTube video (“if you stop watching now, it would also produce data, probably pushing your profile’s impulsivity score up by 0.7 points, meaning in five years of time, some recruiting algorithm will reject you for that job of…”). But if you haven’t, this is your front-seat treat, cost- and Zoom-free!
To complete the metaphor ‘from playroom to boardroom’ (and this lengthy editorial), the article by Ulla Kruhse-Lehtonen and Dirk Hofmann on “How to Define and Execute Your Data and AI Strategy” is a must-read (and I rarely use this phrase, even for my own articles) for any business leaders. It makes a host of concrete and practical recommendations, “from setting the ambition level to hiring the right talent and defining the AI organization and operating model.” Effective and timely communications within a business organization are among the most crucial priorities, not surprisingly. Indeed, this is the first time I had heard of an ‘AI strategist’ position, and I am not alone. As Kruhse-Lehtonen and Hofmann wrote, “Most companies do not have this role, but we see it as one of the most critical roles in the successful execution of Data and AI projects. Without an AI strategist, the communication distance between people with a business background and the data scientists is often too wide and can take some time to align.” This stresses further the point made by Katie Malone in her “Active Industrial Learning” column in issue 2.1. on how to deal with the great challenge of effective communication for successful business.
All right, I better stop this editorial before it becomes a textbook example of ineffective communication. But I do hope it conveys both a sense of excitement and urgency about addressing many challenges in data science. I am looking forward to discussions and debates from multiple perspectives on Wing and He-Lin’s lists of challenges, as enhanced and demonstrated by other articles in this issue. I am also looking forward to hearing from other constituting disciplines of data science to provide their list of X challenges, whatever the X might be. Interested authors should send their inquiries to editorinchief_datasciencereview [at] harvard [dot] edu.
Happy reading, pondering, and cooking!
As usual, this editorial is in a readable form thanks to numerous comments and edits from many authors and board members. I am especially grateful to Ms. Martha Eddison Sieniewicz, Special Assistant & Senior Communication Strategist to MIT President Rafael Reif, for going beyond the call of duty to provide detailed edits; and to Radu Craiu and Robin Gong, for always making sure that my fingers do not outpace my brain. I am also grateful to Mark Glickman for making his Recreations in Randomness column ever more delicious, and for encouraging me to provide the okra recipe, which I learned from a most memorable lunch in Shanghai when I was presented with okra for the first time. I immediately knew that it would not be the last time, even though I had no idea what these crunchy things were, in Chinese or English.
Xiao-Li Meng has no financial or non-financial disclosures to share for this editorial.
Input: Fresh, well-washed whole okras.
Step 1: Place okras in boiling water for X minutes, and then transfer them to a serving plate.
Step 2: Put the plate in a refrigerator for Y minutes.
Step 3: While waiting for the okras to be cooled and slightly chilled, prepare a dipping source.
My personal favorite is sushi-grade soy sauce and wasabi. Second favorite is soy sauce with a few hot peppers, and a few drops of sesame oil (optional).
Output: Delicious crunchy okras (level of crunchiness is adjustable via X).
1<X<2 minutes [X<1: too-raw region; X>2: entering the slimy region]
15<Y<30 minutes [depending your refrigerator; over-chilling diminishes the crunchiness]
Your personal favorite (dry or exotic) white, and any article from HDSR.
©2020 Xiao-Li Meng. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.