Skip to main content
SearchLoginLogin or Signup

Building a More Robust Data Science, Toward a More Robust World

Issue 4.3 / Summer 2022

Published onJul 28, 2022
Building a More Robust Data Science, Toward a More Robust World
·

As I sit in London reading the final version of these articles, I am reflecting on the fact that a hypothetical weather forecast for 2050 is coming true in the United Kingdom over the next two days. Highs could approach 40 degrees Celsius (around 104 degrees Fahrenheit) for the first time—topping the United Kingdom’s hottest temperature ever recorded, which was 38.7 degrees Celsius at the Cambridge Botanic Garden in 2019. Meanwhile, fires are burning out of control across Europe. I can feel a shift in the seriousness with which climate change is being discussed, but we will need to turn these words into action and get to net zero targets around the world.

As always, reading HDSR fills me with hope. I am especially happy to read our special theme on the topic of Changing the Culture on Data Management and Data Sharing in Biomedicine, guest edited by Christine Borgman, Distinguished Research Professor and Director of the Center for Knowledge Infrastructures at University of California, Los Angeles, Maryanne Martone, Professor Emeritus of Neurosciences at University of California, San Diego, and Richard Nakumara, former Director of the Center for Scientific Review at the National Institutes of Health (NIH). Data sharing and the promise of open science was dramatically illustrated through DeepMind’s breakthrough AlphaFold algorithm for predicting protein structure, which they have in turn open sourced and trained on the protein structures that had been placed in public repositories by structural biologists.1 This special theme arose from a workshop titled “Changing the Culture of Data Management and Sharing” that was held virtually by the National Academies of Sciences, Engineering, and Medicine on April 28 and 29, 2021. Guest editors Martone and Nakumara summarize the workshop and discuss the key takeaways in their special theme editorial, “Changing the Culture on Data Management and Sharing: Overview and Highlights from a Workshop Held by the National Academies of Sciences, Engineering, and Medicine” (2022b). Beyond a collection of reports, Martone and Nakumara also share a fascinating conversation with Dr. Lawrence Tabak, current Acting Director of NIH, and Dr. Lyric Jorgensen, Acting Associate Director for Science Policy and Acting Director of the Office of Science Policy at NIH (Tabak et al., 2022). Data sharing and open science are necessary components of building trust in science and improving the efficiency and effectiveness of science, not only in allowing for reproducibility, but in changing the norms and thus the equilibrium under which we all operate—knowing that important results will be checked and verified by independent labs. I found the following remark by Martone and Nakumara especially striking: “Honoring data sharing means data is valued as a first-class research output by researchers, societies, institutions, libraries, journals, and funders, and people can advance their career through data sharing in the same way they can through publication” (2022a).

Turning to the rest of this issue, our friends at the Michigan Institute for Data Science (MIDAS), Jing Liu, Jacob Carlson, Joshua Pasek, Brian Puchala, Arvind Rao, and H. V. Jagadish, share their own work “Promoting and Enabling Reproducible Data Science Through a Reproducibility Challenge” (Liu et al., 2022). (I can’t resist here to also point to our own efforts at the HDSI to study Trust in Science through a data science lens, under the leadership of Sheila Jasanoff.) Liu and her colleagues describe the University of Michigan’s Reproducibility Challenge, which seeks to understand existing efforts in promoting reproducibility and make recommendations for the role of universities (there is also an online collection of tools and processes). The authors categorize the responses to a survey of researchers around the university as falling into various themes, advocate for the availability of consultation for reproducible research and increased investment in resources for reproducibility, and make recommendations for universities and data science units in promoting data science reproducibility.

The article “From Unicorn Data Scientist to Key Roles in Data Science: Standardizing Roles” by Usama Fayyad and Hamit Hamutcu (2022) shares the ongoing work of the Initiative for Analytics and Data Science Standards, which has developed a framework for professional standards in data science. An earlier article by Fayyad and Hamutcu (2020), also published in HDSR, gives an overview of their knowledge framework. Although there’s plenty to like about unicorns, at least in general terms, it seems very sensible to emphasize, as Fayyad and Hamutcu do, the myth of the ‘unicorn data scientist’ who can master everything required to fulfill a data science role. They stress instead the role of data science teams and the use of three ‘post-unicorn’ titles, those of data scientist, data analyst, and data engineer, and the article maps each one of these to associated technical knowledge. I’m always a fan of simplification when it brings clarity, and I think these authors have done just this.

My colleague, Iavor Bojinov at Harvard Business School, and Somit Gupta, a Principal Data Scientist on Microsoft’s Experimentation Platform (ExP) team, have contributed a thoughtful article, “Online Experimentation: Benefits, Operational and Methodological Challenges, and Scaling Guide” (Bojinov & Gupta, 2022), which digs into the importance of business experimentation and the challenges—both organizational and methodological—in getting experimentation right. Their discussion builds on the shoulders of Ronny Kohavi, who headed ExP for over a decade and popularized another animal (the HiPPO, no not this one but this one). Some of the operational challenges that Bojinov and Gupta highlight include the need to develop the culture and capacity for experimentation, including a suitable data and experimentation platform. Methodological challenges relate to the complexity of experimentation, arising for example from the need to understand long-term impact from shorter-term measures (it may take 12 months or longer for the full effect of a change to be realized, but be too costly to run a treatment for that long) and to handle interference amongst different units of an experiment (this is one of the reasons why some social media companies have been known to use New Zealand as an experimental unit, given its relative isolation from the rest of the world!). Bojinov and Gupta also note, with some irony, that to really understand the value of an experimentation culture, we’d need to run an experiment!

Vaishali Jain, Ted Enamorado, and Cynthia Rudin (2022) contribute a compelling study, The Importance of Being Ernest, Ekundayo, or Eswari: An Interpretable Machine Learning Approach to Name-Based Ethnicity Classification,” delving into name-based ethnicity classification. This is an important tool with which to estimate racial bias across different settings ranging from lending, to health care provision, to resume screening, to policing. Whereas others have developed deep learning pipelines, this article reports better performance (in terms of balanced accuracy, and accuracy on minority groups) from a carefully crafted, multinomial logistic regression classifier. The authors also emphasize the importance of knowing what is not known—the “indistinguishable names”—where there is high uncertainty and where it would introduce bias to make a hard assignment. The method is showcased and understood in exploring campaign donation trends for the 2020 presidential race (Trump vs. Biden) and 2020 Georgia Senate race.

Lastly, we have an interesting piece in this issue’s Minding the Future column, which is always expertly edited for us by Nicole Lazar. The article, Bringing Complex Data Into the Classroom” by Pip Arnold, Anna Bargagliotti, Christine Franklin, and Robert Gould (2022), is written by four authors of the Pre-K–12 Guidelines for Assessment in Statistics Education II (GAISE II) (see also HDSR article “Introducing GAISE II: A Guideline for Precollege Statistics and Data Science Education” by Franklin and Bargagliotti [2020]). This is the last of a series of articles that we have featured on GAISE II (see Perez et al. (2021) for “Level A” students and Arnold et al. (2021) for “Level B” students), this one turning to “Level C” students (roughly high school age in the United States, 14–18-year-olds). Arnold and colleagues showcase the use of a data visualization in the news as a way to engage learners with complex data found in the news through a collaboration between the American Statistical Association and The New York Times. This makes use of a graphical visualization based on the American Time Use Survey (ATUS), which measures the amount of time people spend doing various activities, such as paid work, childcare, volunteering, and socializing. The authors have developed a curated version of the ATUS dataset and also share a potential lesson plan in the Supplementary Files of their article. It is inspiring stuff, encouraging students to interrogate data and grapple with its quality and provenance, including asking questions about data limitations, potential biases, and the promise for statistical analysis. Wishing you all happy data science reading!


Disclosure Statement

Other than the aforementioned current sabbatical at DeepMind, David Parkes has no financial or non-financial disclosures to share for this editorial.


References

Arnold, P., Perez, L., & Johnson, S. (2021). Using photographs as data sources to tell stories. Harvard Data Science Review, 3(4). https://doi.org/10.1162/99608f92.f0a7df71

Arnold, P., Bargagliotti, A., Franklin, C., & Gould, R. (2022). Bringing complex data into the classroom. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.4ec90534

Bojinov, I., & Gupta, S. (2022). Online experimentation: Benefits, operational and methodological challenges, and scaling guide. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.a579756e

Fayyad, U., & Hamutcu, H. (2020). Toward foundations for data science and analytics: A knowledge framework for professional standards. Harvard Data Science Review, 2(2). https://doi.org/10.1162/99608f92.1a99e67a

Fayyad, U., & Hamutcu, H. (2022). From unicorn data scientist to key roles in data science: Standardizing roles. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.008b5006

Franklin, C., & Bargagliotti, A. (2020). Introducing GAISE II: A guideline for precollege statistics and data science education. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.246107bb

Jain, V., Enamorado, T., & Rudin, C. (2022). The Importance of being Ernest, Ekundayo, or Eswari: An interpretable machine learning approach to name-based ethnicity classification. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.db1aba8b

Liu, J., Carlson, J., Pasek, J., Puchala, B., Rao, A., & Jagadish, H. V. (2022). Promoting and enabling reproducible data science through a reproducibility challenge. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.9624ea51

Martone, M., & Nakamura, R. (2022a). Changing the culture on data management and sharing: Getting ready for the new NIH data sharing policy. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.6650ce2b

Martone, M., & Nakamura, R. (2022b). Changing the culture on data management and sharing: Overview and highlights from a workshop held by the National Academies of Sciences, Engineering, and Medicine. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.44975b62

National Academies of Sciences, Engineering, and Medicine. (2021, April 28–29). Changing the culture of data management and sharing: A workshop. https://www.nationalacademies.org/event/04-29-2021/changing-the-culture-of-data-management-and-sharing-a-workshop

Perez, L., Spangler, D. A., & Franklin, C. (2021). Engaging young learners with data: Highlights from GAISE II, Level A. Harvard Data Science Review, 3(2). https://doi.org/10.1162/99608f92.be3c2ec8

Tabak, L., Jorgenson, L., Martone, M., & Nakamura, R. (2022). Conversation with Dr. Lawrence Tabak and Dr. Lyric Jorgenson on the NIH perspective on data sharing and management. Harvard Data Science Review, 4(3). https://doi.org/10.1162/99608f92.b9e4ceec


©2022 David Parkes. This editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the editorial.

Comments
0
comment

No comments here

Why not start the discussion?