Harvard Data Science Review’s interim Co-Editor-in-Chief Francesca Dominici recently met with Dr. Sylvia Richardson, Emeritus Director of the Medical Research Council Biostatistics Unit at the University of Cambridge, Immediate Past President of the Royal Statistical Society, and HDSR Associate Editor. The pair discussed the definition of data science, Dr. Richardson’s data science journey, her areas of research, and what it means to be an effective data science leader.
This interview is part of HDSR’s Conversations with Leaders series.
HDSR includes both an audio recording and written transcript of the interview below. The transcript that appears below has been edited for purposes of grammar and clarity with approval from all contributors.
Francesca Dominici [00:00:02]: Welcome to Conversations with Leaders. My name is Francesca Dominici, and I am a professor of biostatistics at the Harvard T.H. Chan School of Public Health and a faculty co-director of the Harvard Data Science Initiative. It is really my honor today to be speaking with Professor Sylvia Richardson. I've been having the pleasure of knowing Sylvia for a very long time and looking up to her as a really good role model for women in statistics. And she's a professor at the University of Cambridge and past president of the Royal Statistical Society (RSS). Welcome, Sylvia.
Sylvia Richardson [00:00:50]: Thank you for inviting me, Francesca.
Francesca Dominici [00:00:52]: I'm honored to host Sylvia today on behalf of Harvard Data Science Review. And as you might know, our Xiao-Li Meng formally launched Harvard Data Science Review a few years ago with the idea to make data science truly, truly accessible to everyone and to build a community of data scientists to try to understand what data science is about. And this is part of a series of podcasts that we have with Conversations with Leaders, where we are actually learning from them what has been their career and their ideas about data science. Sylvia, so tell us: What's data science for you?
Sylvia Richardson [00:01:44]: It's hard to be original, but I was racking my brain for a good metaphor, and came up with the metaphor of a rainbow of interconnected disciplines, sharing the common aim of making the best use of data-rich environments we live in to solve problems in society. So, like in a rainbow, data scientists have to work together to draw out information from data. And the colors must match, tough they are different. Similarly, there are different but intersecting data science tasks, taking different shapes and forms. As data scientists, we recognize and enjoy diversity, we're not doing all the same tasks. Nevertheless, there is a backbone, a shape to the rainbow. And for us, this backbone is probability theory, study design, and quantifying uncertainty using statistical thinking. We also know that rainbows change all the time. They don't last, but they keep reappearing. Data science is also evolving constantly because new questions and new types of data keep arising. In a similar way to the rainbow which is strongly influenced by the atmosphere, one key aspect of data science is that we have a strong link to practice. So, we work together to solve problems from different perspectives, we evolve, we try to be relevant to science and society, and make the best use of the data.
Francesca Dominici [00:03:33]: Such a wonderful metaphor, Sylvia, thank you. And rainbows, colorful rainbows, happy rainbows, integrate data and mimic the rainbow's flexibility. It has a beautiful shape that really communicates the foundational principle of what data science is as a discipline. And you correctly pointed out the importance of flexibility because this is such a quickly moving discipline. So, looking back at your incredible career with so many achievements—you have been changing different institutions, you have worked in different fields. I'm wondering if you could tell us what are the fields of data science that you have been mostly interested in that really excited you? And what have been the biggest impact in the field and in science overall?
Sylvia Richardson [00:04:42]: Thinking back, there are three main areas of statistics which have excited me: spatial statistics, mixture models, and high-dimensional analysis of genomics data. I realize they appear quite diverse, but the common thread for me was to try to best exploit heterogeneity in epidemiological, biomedical, or molecular data to learn about health determinants or biological mechanisms. The way I've approach this has been to build Bayesian graphical models where the structure is informed by subject matter knowledge. I worked first in spatial statistics, as you know. And my first contribution was to develop a test of association between two spatial processes where you decrease the degrees of freedom to account for the spatial correlation of the data points. This test has been widely used in geography, ecology, and especially epidemiology. But my main focus turned to demystify the origin of what is called ecological bias, a bias which arises because when you analyze aggregated data, you lose individual-level information. So, studies based on area level data are hard to interpret. But then I showed how it was possible to control ecological bias by combining aggregate and individual-level information from a survey, if you happen to have survey data on the same risk factors. This idea of controlling bias through evidence synthesis, comes up in many different contexts and actually was central in some of our recent COVID work to estimate prevalence.
[00:06:32] Now, a natural way to fully explore heterogeneity in biomedical data is to turn to mixture and partition models. And this was my next focus. Mixtures can be used at the observational level or in Bayesian hierarchical models, for specifying flexible prior distributions of latent quantities. In either case, if you want to be flexible—this was my purpose—you don't want to assume you know in advance the number of components, but you want to learn it. Using Peter Green's reversible jump paradigm, together we worked to develop fully Bayesian mixture analysis with unknown number of components, and this led to our 1997 RSS discussion paper. But while I was developing that work, I was always very conscious that clustering and mixture modeling are fundamentally unidentifiable problems. What is the ‘right’ mixture? It has to be defined with respect to an objective. For answering health questions, I thought that a useful way to make the clustering output more interpretable was to look for mixtures or partitions which are predictive of a health outcome of interest. For example, to cluster patients sharing similar biomolecular profile and level of risks. And so we develop what we called “outcome-guided clustering,” in which cluster membership is associated with an outcome. This approach has found many applications in epidemiology and beyond. Currently, we're using it to characterize patterns of multimorbidity in electronic health records. So, it really is quite versatile.
[00:08:36] Finally, like many of us, I was excited by the explosion in the collection of genetics and genomics data in the years 2000, and I got interested, in particular, in integrative genomics and understanding gene regulation. Luckily, I had many biological colleagues to work with and to learn biology from. And so, I developed an ambitious analysis framework of hierarchical regressions adapted to this question. It was ambitious because it was high dimensional in two directions. My aim was to analyze jointly many responses, typically whole sets of gene expressions, and many predictors, typically a block of correlated genetic variants. The key idea was that by borrowing information between the responses, we would have more power to find genetic variants, so-called hotspots, which are associated with the expression of multiple genes. Such variants are of interest as they may initiate key biological mechanisms. Recently, we proposed a variational Bayes version of this model, which scales up to 20,000 gene expressions and thousands of genetic variants. In fine, our hierarchical regressions framework and its scalable implementation gives a powerful tool to biologists for understanding gene regulation. Lately, both for mixtures and genomics, I have focused on making Bayesian computations effectively scalable—and I have turned to use, if necessary, approximate inference in order to increase the impact and take up of the models and tools that we are developing.
Francesca Dominici [00:10:22]: Thank you. Thank you, Sylvia. I think what has been fascinating to me, listening to your career path, is that I think all of your interests really exemplify what data science is about, which is extracting new information from data and your interests into extracting information from data that can be spatially varying. They are collected at different level of aggregation, they are complex, and developing a model that needs to be flexible to understand heterogeneity and as well to allow for scalability. It's really these are all real importance statistical concepts in statistics. And I know that you have made enormous contribution also in the field of genetics and genomics, which is also one of the really high-importance areas in data science. So, going back to you, and there are really too many contributions than we have time to talk about, but tell us, what's your favorite one? What was the one that got you more excited? That kept you up at night? Is there one, or did you embrace everything with the same amount of enthusiasm?
Sylvia Richardson [00:11:48]: I embrace everything with the same amount of enthusiasm, but I would say that a key strand of my research has been mixture model, flexibility, and understanding sources of heterogeneity—from the macro to the micro levels. For example, understanding the drivers behind heterogeneity of risks at the aggregated geographical level or investigating heterogeneity of molecular phenotypes at the biological level. And I'm carrying on working on that topic at the moment, in particular on scalability of clustering methods for large data. I would say that this is one area which has been a constant strand of my research.
Francesca Dominici [00:12:24]: And it is, I think, one of this foundational thinking in data science that is underlying most of the data application, right? Whether you are thinking about environmental data or genomics data or biomedical data, we are all dealing with the issue of heterogeneity. So, what do you think are the biggest challenges that the field of data science is facing?
Sylvia Richardson [00:12:49]: I believe one of the biggest challenges and difficulties is fragmentation. Because we embrace diversity in data science, but this diversity could lead to fragmentation or, on the contrary, be very beneficial. We have to make sure that the distinct disciplines involved in data science with their different application fields and different way of thinking, are not made into silos, and that we manage to make this ecosystem something very rich for all of us and for data science. The project of Harvard Data Science Review saw the potential threat of fragmentation and took the lead in trying to counter it, so that data scientists talk to each other. And I would really encourage more initiatives to translate concepts across the disciplines that make up data science. I would also encourage that we endorse some key ideas across the disciplines. For example, that all data scientists have to be educated about selection bias! Because if you think of all the algorithms which use training data without really scrutinizing the provenance of the training data and whether it could have potential bias, there is a lack of awareness of the potential consequences of selection bias on the performance of the predictive algorithms. This is just one example of a key challenge and focus of attention that should be endorsed by all of us. More broadly, it would be useful to gather a list of key concepts that we should all understand in a fairly similar and deep way. Then we could build interactions more effectively.
Francesca Dominici [00:14:37]: So what do you think are the biggest opportunities?
Sylvia Richardson [00:14:41]: Broadly speaking, I believe that bringing subject matter knowledge into the big world of algorithms and machine learning is one big opportunity. And you wouldn't be surprised if I highlight, in particular, data fusion as a huge opportunity to deliver progress for society and science. It requires good computational methods combined with deep understanding of what the problem is and what are the characteristics of the data types. Then you have to be creative and translate your knowledge into the formulation of a model and overarching integrative structures which encode your assumptions. So this is the opposite of a black box approach; you cannot do data fusion, in my opinion, automatically as a black box procedure. But implementing data fusion on real big problems is hard, and there's definitely an opportunity to learn from each other, for example, by bringing major challenging cases of data fusion out in the open, getting different teams to think about solving these, and drawing lessons on how we can make progress in implementing informed scalable data fusion.
[00:16:06] When we were working during COVID in the Turing-RSS Health Data Lab supporting the UK Health Security Agency, we tried to operationalize scalable data integration and we focused on modularity as one of the key ingredients. There is great interest in the topic of modularity at the moment in data science. I view that trying to understand how to build large scale models in a modular fashion and how to transfer information between the modular components, possibly, in an approximate way, will make data science have a key impact on the big challenges that the world is facing; for example, climate change. Such an approach to modelling was very useful during the pandemic. So, implementing knowledge-informed data fusion and making it scalable and modular seems to me an important opportunity.
Francesca Dominici [00:17:02]: Thank you. Thank you, Sylvia. And it's incredible how you're talking about challenges and opportunities, again, that are really illustrations of the central role of data science in tackling the most important societal problems, including COVID and climate change. So, we're getting at the end of the conversation, but I wanted to take the opportunity by talking to a leader like yourself to give us some information and some advice about leadership. You have been the president of the Royal Statistical Society. You have so many accomplishments, took many leadership positions. So how is it to be a leader in data science? What is your leadership style, and what are the recommendations for the future generation of how to be an effective leader in this new world of data science?
Sylvia Richardson [00:18:09]: Well, it will be quite challenging because, to be an effective leader in data science, I think you need depth and breadth. Usually, it was enough to have depth. The standard recognized qualities for a leader in academia were very good publications, having trained a lot of people, having made an impact, and having thought very deeply into one domain and be a recognized expert in that domain. Data science leaders, indeed, have to be deeply anchored and expert in one domain. But, on top of that, they also need to be able to reach across the unique breadth of the data science fields. Crucially, that means having the ability to work effectively in an interdisciplinary manner and being convinced of the benefit of bringing different disciplines to bear on problems. So, you need to be open and able to make connections between your own set of experience and questions in other domains. In effect, to be constantly looking for connections between different topics and suggesting these to others, as well as be very generous in sharing your ideas and your intuition, so that you encourage people to take up and try these ideas. That has been my style of leadership.
[00:19:28] Another quality which I think is important is to keep a critical mind and be prepared to be openly critical. That comes with experience. As part of our training, each of us develop a critical mind, but the next step is to be able to ask nicely a critical question; that is, dare to criticize. Discussion meetings at the RSS are an excellent training school to see how people can be highly critical but in a nice and constructive way. This is something to encourage in the future.
[00:20:04] Also, to be a good leader you need to encourage equality and diversity and put that high on the agenda and think of the consequences of whatever decision you take with respect to diversity. In the RSS, we appointed last year an Honorary Officer for Equality and Diversity and we're pushing this agenda to the forefront. I am highly supportive of that, as every leader should be.
[00:20:33] On the data science front, while I was president, I felt a sense of urgency to encourage the RSS to revisit its engagement with data science, and I created a data science task force right at the beginning of my presidency. It didn't get going earlier because there was COVID to keep us busy! Nevertheless, the Data Science Task Force got underway in 2021 and came up with two major recommendations. One was to give more resources to the practitioners’ community, which led the RSS to create a Real World Data Science online platform. A second direction was to brainstorm on what is still needed for the discipline to thrive. [HDSR Founding Editor-in-Chief] Xiao-Li was part of this task force and helped the task force to brainstorm. The idea that it would be good for the RSS and for the whole of the community to create a new data science journal emerged from these discussions, and we are now in the process of developing a pan-data science journal! So, if I look back at my presidency, I am happy to have been able to contribute to formulate a renewed strategy of engagement with data science at a time when it was absolutely topical. It was urgent that the RSS increased its visibility and its direct impact on the field.
[00:22:08] We have a lot of responsibility nowadays as data scientists, because one consequence of COVID has been that people have realized how important it is to analyze data in a correct way. The pandemic has shown that analyzing data is not just an abstract exercise—it can directly impact people's life. There are huge societal implications to our work. The future leaders will have to build on that increased recognition, take this responsibility in hand, be conscious of the great interest that society has for data science, and take a view on how to best progress our endeavors.
Francesca Dominici [00:22:50]: So how is it to be a woman leader in data science?
Sylvia Richardson [00:22:55]: As you know, you do become a role model as you get older, whether you want it or not. I've tried to be very open about the struggles I have faced and the different choices one has to make. We all have to consider work-life balance, but as a woman, there are particular types of ‘balances’ you have to deal with, and particular issues you have to contend with. The most helpful is to be open about it and make sure people don't think there's only one way to achieve a successful career, but that there are many paths to becoming a good data scientist and having a voice as a woman leader. And what you yourself need to do is encourage and support women colleagues and women students to find their own fulfilling path. Mentoring and finding mentors to help younger generations of women make crucial choices at some key time points is how I like best to share my experience. I believe very strongly in mentoring.
Francesca Dominici [00:23:52]: Thank you, Sylvia. This has been so illuminating and fun. Thank you so much for your time and for sharing your experience and your wisdom.
Sylvia Richardson and Francesca Dominici have no financial or non-financial disclosures to share for this interview
©2023 Sylvia Richardson and Francesca Dominici. This interview is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the interview.