Xiao-Li Meng (XLM): Thank you, Bin, for joining us today for this great conversation. I want to thank you again for contributing a 50-page article on COVID-19 to the Harvard Data Science Review. It's really an incredible and important project. I want to first ask you, what motivated you to put together a team to work on this really big and timely project?
Bin Yu (BY): Thank you, Xiao-Li, for having me. It's really my pleasure to tell the stories behind the paper. So, in late March, I saw a call for data science expertise by a very new nonprofit organization called Responsible Life. Immediately, I responded. I remember it was a Saturday, like March 21st or 22nd [2020], because I always have been advocating for making a positive impact with our work—statistics, data science, or machine learning. I thought that was an opportunity to actually walk the talk. I reached out to some of my students, and they all jumped in with no hesitation. I tried to protect some people. I didn't reach out to everyone, just because I thought they had other things to do. But in the end, they jumped in, too, and we had 12 people, including a former student.
XLM: That's actually a lot of people to work with, particularly with such an important topic.
BY: In the end, it turned out like that, but I didn't really have time to think, I just felt that it was the right thing to do. I had the expertise, my group had the expertise, and what we did to help was really when we had this crisis of PPEs [personal protective equipment]. I hope we don't see that in this third or second surge of the COVID right now. The number is actually higher right now than in March.
XLM: You have done many projects in your life and in your career. What made this one particularly stand out?
BY: Well, in the sense that there was not much planning. Usually I plan, but I didn't have time for this one. We had to do something, as other people's lives were on the line. So, it's a very condensed timescale, the whole organization and us. The organization was only one week old. It was just volunteers getting together. It was like a data science ER project, an emergency room project. We needed so many skills that are usually not needed for a traditional research project.
XLM: That's actually a really important point. Let me pick up on that. COVID-19 has generated this awareness that the data science itself, including statistics, does not really have a kind of a rapid response team like it is in an ER environment. So, for you, it sounds like you have to do everything from scratch. I remember you mentioned it's warlike. Is the environment really like that?
BY: Yes, because we had to do something—that was clear in everyone's mind—but what hadn't been thought out. So, the first thing I asked, responsibly, was, ‘Where's the data? We're data scientists.’ And they said, ‘We have no data. You have to find the data yourself.’ It was like I had to take a step back. And then I started asking, ‘Suppose we distribute the PPEs’—I was just using my usual research-planning mind, thinking of all the steps—and then I would ask, ‘Who will be receiving the PPEs at the hospitals?’ They said, ‘Oh, that's a good question. Maybe you could look into that.’ Everything I asked was usually answered with ‘Maybe you could look into that,’ because I was just volunteering and I ended up on the leadership team very quickly.
Every day we had an 8:30 a.m. call to plan for the next day. And the personnel was also shifting because people come and go. Some people could only take two weeks off from work. They had to go back to their jobs. There are two people we work very closely, the founder, Rick Brennan, and also the logistics lead, Don Landwirth. Other people would just come and go, very dynamic. My team, my students, all stayed and they were really amazing. We work well together. And I had a deputy, one of my students in CS, Chandan Singh, who was really, really a super organizer. I interfaced with a lot of other connections, getting the background information, and he would be organizing. I would engage, but he would also do a lot of detail organization of my team and also interfacing with the logistics team and with the Salesforce database system. So, really, without him, it would have been very difficult, even with the team I had, to really make it work.
XLM: I want to follow up on something you said, which I think is really a part of the key of how we will build this kind of fast response team in data science. As you said, you asked a lot of questions, then end up with people coming back to you, saying ‘You're the one who should think about that.’ Now, as researchers, we are always thinking broadly, deeply, but we usually take time to do those thinking. But that's in academic research where we have time to do those things. Now you're pretty much put on the spot to make a decision, even maybe need to prioritize things. I want to ask you, though, what were your guiding principles? How did you make these fast decisions? I know you work on principles like stabilities, all of these issues. Would that thinking help you in this kind of environment?
BY: Not the first few weeks, because we didn't even have data and the pipeline was not ready. Later, I think, it was useful. In the beginning, it was really not traditional data science or statistics work. Most of the things—which, as a person, I didn't mind doing or like doing, all the other things—but it was not part of our training, for me or for the people in our program. There was a lot of human connections. I always like people and all my connections helped. So, everybody I reached out to was super helpful. I always rely a lot on my personal social network, and everybody was just amazing. It was just a lot of people I was reaching out to, and everybody just responded in such a timely manner. We basically didn't have any break for two months. Just sleep and get up and do this and organize the teams into, like, data team, modeling team, and also an interfacing PR team, volunteer organization team. And later we had the writing. I brought more people in to do the writing of the paper. So, it's very much divide and conquer. But one thing I have to say that I think we had already a very good collaborative and coherent group culture where nobody felt like they were not treated fairly. Everyone just doing the right thing and we work very seamlessly together without any planning. So, I was very pleased with how everybody worked together. It was a positive experience for all of us.
XLM: As I'm listening to you, I started thinking that you actually laid out a few guiding principles of building this rapid response team. One is, obviously, this kind of coordination. The other, as you said, is to divide and conquer. Each team has its responsibility. The other thing I thought was very interesting when you said you just reached out to all kinds of people. I think a technical term for that is that you’re literally collecting data in real time and even without any particular protocol. But you are trying to find data, whatever you can, right? That sounds like what you were doing at that time.
BY: Yes, it's also expertise. First, I thought I had to understand how the logistic supply chain works, so I lined up an OR team from our IOR exchange team. In the end, we didn't work with them, but they were ready to work with us, if I have to do some optimization supply chain with all this cost. In the end, we did an holistic approach because there were donors engaged and they wanted to ship to certain places. So, it's not really completely left to us.
But the data team was Tiffany Tang and Hugh (Yu) Wong. Amazing students, because, usually for statisticians, we don't do that; we get hand-over data. But they didn't hesitate. Just jumped in. I was reading news all the time. My team was too. And we're all sending them links to scrape data. Tiffany said that she actually fell in love with data cleaning and curation. I thought that was wonderful, because that's the spirit I think we need to have if we want to cover everything instead of waiting somebody to clean the data for us before we can do anything.
XLM: Right, that definitely has been a problem for us as a profession. As much as we emphasize data collection, experiment design, sample samples—we know all this stuff—but in terms of who actually cleans the data, usually that's not done by statisticians. This kind of ‘dirty work’ to us, done by someone else, right? But anyone who has done real data science knows that 70–80% of time is spent on trying to preprocess, get the data clean, and make them useful.
BY: I think we should have some award or some competition on this front to encourage to make it positive to do this good, important work.
XLM: That's a good point.
BY: And for me, it was not clear to us what we signed up for, or what turned out to be doable, because the hospital didn't want to receive the PPEs and we couldn't find the makers to make them. Whatever we do, without the positive impact, we can still say we tried. But one thing was clear in my mind was that the database we curated would be very useful. So I said very early on, that the minimum contribution we can make is to curate this data repository. That's why, if you look at the title, data repository was in the title of the paper. And I felt like that's a contribution already. With all the previous work I have done, data is always a problem and I feel a lot of people might be doing similar things. So if we could help all the other teams to do their job, even if we couldn't push very far—in the end, that we did too—I felt that there was enough contribution.
XLM: Yes, well, thank you again for sending the paper. I want to follow up on that, as well. You may remember that initially I reached out to you about potentially writing a paper. At the time, you were way too busy. You were not really thinking about putting together an article. So, what really helped you to actually put together this article? Writing an article is another task, and for this kind of a project it’s not easy at all, with so many moving pieces and so many people involved. How did you organize this article itself? That’s a useful lesson for the data science community as well.
BY: So what happened was that, after we had data, we had a discussion of what to try to predict. We went for the death count instead of the case number, because the case number especially in the beginning really relied on how many tests were done. It was very rare. Now it may be more stabilized, and we can kind of adjust, whatever. For the death count, the highest granularity we can get is at the county level, and we had to impute to the hospital level with the number of employees. So that’s what we felt the best we can do for the goal we had in mind.
Then the modeling team—Nick Altieri and Xiao Li—would just try out so many different ways to do the prediction and develop multiple methods. So, this is—if you talk about stability—not just try to find one generative model. We were very engineering, machine learning, and signals processing, and build on some earlier work I did 10 years ago at Bell Labs trying to combine different audio signal predictors. So, that was handy. We know how to combine. I would say I know how to combine bunch of predictors. And then Chandan, my deputy, just started on his own to build a website. He’s really, really handy. Old friends of CS. He could just build a website.
And then we started putting our predictions on the website. Now I felt compelled that we had to explain what we did. Even with shared codes, it’s not that accessible. You cannot ask everyone to read the code to know. And there were a lot of other teams putting out predictions that frustrated us because we didn't know how they did their predictions. I thought, we have to make our algorithm transparent and reproducible and a paper has to go with the code, because code itself is not enough. And then I remembered your invitation. So I went back to you saying that "Now I'm actually writing a paper. Would you be interested?"
XLM: Well, of course, I was extremely happy to hear that, but you also know that the reviewers did give you guys quite a bit of work, asking this and that and quite a bit, which is obviously expected. That's what the reviewers do. And I have to say that editing this special issue itself gave me a lot of respect to both the authors and the reviewers. Everybody was working so hard and under very stringent time constraints, and putting in a lot of effort. I want to you talk a little bit about that, because, as I said, the reviewers asked a lot of questions and we actually went through multiple rounds. How did your students, your co-authors, react to this kind of criticisms, because some of them were pretty strong? I won't say there were harsh, but they were pretty demanding. How did you organize a team to respond to all those things, making this process of revising the article also a learning experience for your students?
BY: So basically, the project had kind of two phases. The first was to get the data, get something done, put up the website, write something to document it, so we submitted the paper. Before that, I’d already brought in my book co-author, Rebecca Barber, who is not engaged very much in the modeling data stage, but I brought her in as the lead writer. Many people contributed, but she was the one who helped organize things and people contributed different parts, and that was very crucial.
We continued to work from May to June. Five of my students, who were the driving forces on the team, started their summer internships, so I basically lost all the people who were really in the trenches doing the most work. Raaz stepped up for the revision. Then the revision came back, and I organized a smaller summer team to continue extending it for hospitalization predictions. In early June, when I came back from medical leave, the reviews came back. For me, they were just fair comments. If we didn't have this time constraint, we probably would have addressed some of the comments before they were asked. So I thought they were just fair comments. My students, I think everyone, felt that way. They were good comments and we had to address them.
XLM: It sounds like your approach to organizing the article is very similar to organizing the work itself: divide and conquer, but have someone to coordinate it and, in the end, put everything together for coherence.
BY: Yes, I mean, I'm kind of in the background. I usually work with the organizer very closely and the organizer works with the students and then I jump in if I feel like I’m needed there. So, I tried to do the last round of checking, and usually I find things. I read all the things, but I let Raaz do the correspondence because it saves me time. I'm not really needed. The last round of comments was on visualization.
XLM: Yes, that was my fault. Delayed it and imposed it on you at the very last minute, and I knew the dataviz editor was very critical as well.
BY: It did improve the visualization a lot more. As a good consequence, I ended up running a group meeting to compare the not so-good visualization with the final visualization. I used that opportunity to really get other people engaged, too, to learn something from our own experience, and I plan to use it for my own class.
XLM: That's good.
BY: Usually I critique other people's visualization, but now I have my own paper’s visualization to critique. And I also know my onw project much more inside out, right, instead of other people’s projects: “Oh, they tried to do this, maybe it's not good, but I don't really know why they did what they did.” So I turned this into a lot of positives.
XLM: For people who are not used to writing these articles, the great lesson here is that when you really try to produce a high-quality article itself, you really know the project much better yourself, because you have to dig out all the details, have to put things on the paper, communicate, tie up all the loose ends. There are just a lot more work than most people realize. So, I hope that this is also a very, very educational experience for your students. Did you get any feedback from them? Which part they liked most? Which are they didn’t like, and which part they feel they could have done differently?
BY: I haven't had the chance to talk to them in that detail, but we did, two weeks ago, on the visualization. So, my group, we are making a Google Doc, documenting a lot of good culture, in a more transparent manner. So, we have now a mission statement, code of collaboration, how to write a paper, how to do visualization, how to work with each other, and how to share credit. So that is all documented. And we just keep adding to it. I find myself repeating the same thing to different students because they are going through different stages, and the group is also getting pretty big, so I felt like I need some formality. Actually, the idea of making a manual for the new group members came from a visiting professor last year, Yanjun, who came from the University of Virginia. She asked me right before she came whether I have such a thing for her to know what my group is like. And I said, ‘I don't have anything like that.’ So that's the first time I thought maybe engineering groups actually have such documents. She is from CS. So, I learned a lot actually from the engineering professors who are just much better at organizing teams. This was really triggered by the Yanjun’s question. After she left, I started working out a document.
XLM: But this naturally comes to my next question. It sounds like this project itself is continuing. What’s the next big step in terms of this project? Are you working on similar ones? Expanding this one’s scope? Where are you going from here?
BY: We decide to do a hospitalization prediction. We had UC hospital data, but we still have not been able to access other hospitals’ data, but we do have county-level hospitalization data. We were improving the prediction scheme in the earlier paper, and we also added a lot more to the website. So a lot of work was done by Danqing and Maya, adding to the website that Chandan and James, another student in the first team, worked on. We also have our map with Professor Kolak from the University of Chicago. She heads a center called Spatial Data Science, and they have a different atlas. And they have a layer coming from us. So we have been working together, and it was introduced by my old colleague, Yandell, from Wisconsin. We connected. We’re still supplying our prediction to one layer of their layers of atlas on COVID. That organization really tries to help rural areas about rural health. We actually ended up writing—we felt like our website would put a lot of effort into it. We didn't really get enough traffic. So, we recently wrote a five-page release to a medical journal just to advertise the website.
XLM: PR is very important.
BY: We started moving a little towards causal inference. We have some clustering work on the website, and Danqing also cleaned up some data. For each county, policy changes at the county level. So, we have the raw data available. We just haven't had the human resource to really have a concrete plan, but it's really motivating. I have always the interesting dynamic system, dynamic data. And this COVID really got me into it. I've been thinking a lot, and I'm joining a group of people from Berkeley and MIT for some kind of dynamic AI institute. It's really helped push me to get into dynamic modeling.
XLM: If you had to do all this all over again, which I hope you don't need to, since the situation will be improving, what would be the things that you would have done differently?
BY: I have been thinking about that myself. I mean, for this type of situation, in most real problems, almost all the real problems I worked on, the philosophy is really not optimality. It’s really “Do the best you can, with the resources you have, with the time you have.” If I use that to measure our effort, I’d say that we wouldn’t have done anything differently. We really felt like we put all in for two months, all of us, including my friends and people who supported us. That’s why in the end, with such a rewarding experience, the data team got into something statisticians don't usually go into. I would not have put them on that, to be honest, if the project came our way in a peaceful time.
XLM: Yes, you would not take that.
BY: We didn't have that luxury. That was good, actually, that we didn't. And the modeling team also had to be just—my team usually doesn't work with time-series prediction—starting from very much basic principles. I think all the culture we put in about applied statistics, do the best you can with the skills you have, really worked out. Instead of being afraid, take the risk and we also take the opportunity.
XLM: Well, thank you Bin. It sounds like you actually have achieved optimality. It's just a constrained one, constrained given the time, given the resources, given the demand, given that we all did not know what to do. I think that's exactly what rapid response teams should do. Do the best you can, save as many lives as you can. But along the way, sounds like a you're not only just helping the situation, but you actually provide a huge pedagogical training opportunity to all your students. And I think that that impact itself can last much longer because they will be learning so much from this experience and lead as they move on. So, thank you for this conversation with authors.
And you're listening to the Conversation with Authors from HDSR, where we publish everything data science and data science for everyone.
BY: Thank you, Xiao-Li. It has been a pleasure.
This interview is © 2021 by the author(s). The editorial is licensed under a Creative Commons Attribution (CC BY 4.0) International license (https://creativecommons.org/licenses/by/4.0/legalcode), except where otherwise indicated with respect to particular material included in the article. The article should be attributed to the authors identified above.