Skip to main content
SearchLoginLogin or Signup

In the Academy, Data Science Is Lonely: Barriers to Adopting Data Science Methods for Scientific Research

Published onApr 30, 2024
In the Academy, Data Science Is Lonely: Barriers to Adopting Data Science Methods for Scientific Research
·

Abstract

Data science has been heralded as a transformative family of methods for scientific discovery. Despite this excitement, putting these methods into practice in scientific research has proved challenging. We conducted a qualitative interview study of 25 researchers at the University of Michigan, all scientists who currently work outside of data science (in fields such as astronomy, education, chemistry, and political science) and wish to adopt data science methods as part of their research program. Semistructured interviews explored the barriers they faced and strategies scientists used to persevere. These scientists quickly identified that they lacked the expertise to confidently implement and interpret new methods. For most, independent study was unsuccessful, owing to limited time, missing foundational skills, and difficulty navigating the marketplace of educational data science resources. Overwhelmingly, participants reported isolation in their endeavors and a desire for a greater community. Many sought to bootstrap a community on their own, with mixed results. Based on their narratives, we provide preliminary recommendations for academic departments, training programs, campus-wide data science initiatives, and universities to build supportive communities of practice that cultivate expertise. These community relationships may be key to growing the research capacity of scientific institutions.

Keywords: scientific research, computing education, collaboration, mentorship, career paths, academic


Media Summary

Data science may offer new tools and methods to accelerate scientific discovery. Yet for many scientists, the road to adopting new techniques is bumpy. Scientists are typically experts in their research domain (such as astronomy, education, chemistry, or political science) and may have limited training in foundational data science concepts and techniques. Data science can also require materials and tools that labs are unfamiliar with, like systems for managing large data sets and training complex predictive models.

To better understand how scientists cope with these challenges, we interviewed 25 scientists at the University of Michigan who hoped to make scientific advancements with techniques from data science. We find that many scientists try to learn independently using freely available material (like videos, online classes, and blogs), but are usually unable to translate those lessons into applications in their own research. Many experienced ‘hitting a wall’ in debugging their code or understanding how a technique works.

Scientists coped by trying to assemble an informal community of helpful experts. Some sought mentors who could help them choose which data science methods were most relevant to their research questions. Others recruited students and research staff to the lab to assist with programming and data management. Finding and nurturing these connections was time-consuming and laborious, particularly because different scientific fields can have very different cultures, languages, and practices.

This pilot investigation of the barriers facing scientists at a major research university suggests several strategies for scientific institutions to support researchers: investing in mentored training opportunities for scientists, developing more pathways for long-term research support staff, and organizing community-building events and activities for researchers.


1. Dreams and Promises of Data Science in the Academy

The promise of modern data science to advance research has generated palpable excitement in the scientific community. Proponents emphasize possibilities to create data sets on previously unimagined scales (Grossi et al., 2021; Yang et al., 2017), improve statistical models of complex and high-dimensional phenomena (Frank et al., 2020), generate new hypotheses from old data with pattern-seeking algorithms (Anderson, 2008), and translate vast biomedical data sets into clinically useful decision-making tools via predictive modeling (Topol, 2019).

While every day seems to bring new open source software libraries and learning materials that lower the barrier to entry to data science—as well as renewed hype about its applications—excitement about data science is not new to the academy. More than a decade has passed since Anderson (2008) forecasted a paradigm shift from inference-driven investigations of small, curated data sets to prediction-driven, machine-learning-centric ingestion of massive data sets. And many modeling approaches garnering renewed interest among data science practitioners are decades old (for review, see James et al., 2013).

Setting aside the question of whether the data-centric paradigm can supercharge scientific discovery, groups who wish to enter the foray face several practical barriers. Scroggins et al. (2020) presciently identified several ‘thorny’ problems that impede researchers as they scale up the complexity of data and modeling pipelines, including low funding for long-term technical infrastructure, high research staff turnover, and unmet needs for continuing education. Importantly, these barriers are institutional in nature—they are not located solely at the level of an individual research group but are baked into the larger structures and mechanisms of research organizations.

Industry has also grappled with hurdles in implementing data science methods, both internally and as part of products (Sculley et al., 2014, 2015). At Microsoft, Amershi et al. (2019) identified problems making data sets discoverable and manageable, cultivating new data science skills in existing teams, and understanding the behavior of increasingly complex and interdependent data pipelines. If these issues can arise at some of the most financially resourced tech companies, we should not expect the typical academic research group to skirt them.

Of course, science researchers have different priorities and incentives than a tech company; scientists are expected to contribute to their community’s cumulative knowledge through ‘products’ like papers (and to an increasing but still limited extent, reusable data sets and code). But many scientific fields are experiencing shaken trust in the value of papers, which may suffer from low replicability and reproducibility (Bush et al., 2020; National Academies of Sciences, Engineering, and Medicine, 2019; Willis & Stodden, 2020). It is easy to imagine data science methods exacerbating this problem: because increasingly complex data sets, code, and models require substantial technical infrastructure, expertise, and investment in management and documentation, they are likely even less shareable as knowledge products than a data set that can be zipped up in an email or a statistical model that can be written in one line of algebra (Tatman & Vanderplas, 2018). It remains to be seen how exactly these developments will affect the daily and long-term work of science.

We conducted the present study to gain a closer understanding of how science researchers confront the many barriers associated with adopting modern data science methods as tools for discovery. Through ethnographic interviews, we investigated how researchers in fields like astronomy, psychiatry, biology, economics, and education view the promises of new methods, while coping with inadequate time and support to learn them. Crucially, our interviews revealed profound isolation among participants who find themselves without guidance from their home field, while unable to navigate data science as a newcomer. This was particularly surprising considering that we studied scientists at a major research university, the University of Michigan, with a well-funded campuswide data science institute (the Michigan Institute for Data Science), that supports a network of more than 450 affiliate faculty members and offers year-round community events. It is notable that substantial barriers can arise even in this exceptionally supported academic context. Despite the formidable challenges our interviews revealed, they also suggest promising opportunities to better support scientists.

2. Our Investigation

2.1. High-Level Breakdown of Our Strategy

Our study targeted scientists applying, or trying to apply, data science methods to their existing domain of research. For example, a target could be a social scientist learning to use social media data sets to test theories or a psychologist adopting convolutional neural networks for processing neuroimaging data. We sought participants from many colleges and departments at the University of Michigan, excluding researchers who primarily work in mathematics, applied statistics, or computer science.

We recruited participants over campus email lists and through physical advertisements who were interested in using, currently exploring, or already using techniques they would describe as machine learning, data science, or artificial intelligence.1 Participants were asked to provide a few sentences of description about their work (or interest in) data science. From these vignettes, plus consideration to recruit across several research domains and career stages, we invited participants for interviews. A detailed description of our recruitment and participant selection procedure is available in our Supplemental Methods.

It is important to note some limitations inherent in this study design. First, we sought participants based on self-reports that they were doing (or exploring) data science. Although we asked questions both in the screening survey and in the interview about how participants were using data science—which gave us a window into their understanding of what specific activities they defined as ‘data science’—we did not begin recruitment with a strict definition based on usage of a particular statistical method, software library, or technology.

Broadly, interviewees shared a desire to increase the complexity of their data collection, processing, or analysis techniques and were using (or trying to learn) scientific computing for this purpose. Yet because of our sampling strategy, it may be challenging to map our findings onto specific research activities (for example, using the scikit-learn library, or managing terabyte-scale data). There are surely many subcommunities of researchers under the broad banner of ‘academic data science,’ and our study cannot distinguish between them.

Additionally, our sample is not random, as we recruited participants through a mailing list and offered $40 in compensation for interviews. This procedure likely selects for researchers with more flexible schedules, availability for an hour-long interview, personal interest in data science, and/or monetary interest. We might therefore expect overrepresentation of relatively early career researchers and data science enthusiasts, and underrepresentation of senior faculty and researchers in areas where the term ‘data science’ is not widely used. Findings should therefore not be interpreted as a survey of scientists, but as a preliminary investigation into the experiences of a handful of researchers with pronounced personal interest in data science.

2.2. High-Level Breakdown of Our Participant Demographics

We identified and interviewed 25 researchers, encompassing a range of career stages and tracks—including pre- and postdoctoral researchers, tenure- and nontenure-track faculty, and permanent research staff (Table 1). Similarly, our researchers’ experience with data science ranged from tentative exploration to mature research pipelines published several times over.

Table 1. Researcher roles.

Position

Frequency

Postdoctoral researcher

3

Graduate student

5

Faculty

8

Staff

9


Table 2. Research areas.

Broad Research Area

Frequency

Physical sciences/engineering

8

Social sciences

7

Life sciences

10

A detailed table of participant demographics, including staff job titles, gender, and years of research experience, is available in the Supplement (Table S1).

Researchers in life sciences—particularly, those affiliated with the medical school—were especially strongly represented in our sample (Table 2). Less than half (11) of the participants reported that they had successfully published scientific research involving data science methods at the time of the interview. Most participants (14) described themselves as actively exploring or attempting to adopt data science methods.

There were 14 men and 11 women; no participants listed themselves as nonbinary or another gender. Participants were also given an option to self-describe any traits or identities that may be relevant to their experience of scientific work. One participant shared that they were transgender, six shared being the first generation of their family to attend college, and three shared that they were immigrants.

2.3. Transcript Analysis

Following the grounded theory approach, we analyzed interview transcripts for major themes that were identified and refined in several rounds of coding (Saldaña, 2016). Both authors participated in the coding process, reviewing one another’s codes and discussing any uncertainties about code assignments to reach a mutual decision. More details of our transcript coding method are in the Supplemental Methods.

3. The Bottleneck to Adoption

Researchers were broadly enthusiastic about the promises of data science methods.2 Some suspected their existing data sets contain unidentified and unforeseen patterns, which might be revealed through unsupervised learning techniques—for these researchers, data was a resource that had been incompletely mined for scientific value. Others anticipated that novel sources of data, such as unstructured text from public-facing websites and satellite images from projects like Google Earth, could provide a new testing ground for social scientific hypotheses. Several were attracted to the promise of new funding opportunities at the university and through federal agencies for proposals involving data science techniques.

But researchers’ enthusiasm for these methods was quickly tempered by the realities of acquiring a new skill set. Because our study featured researchers who were relative pioneers in their fields, they were unlikely to have formal training routes available to them. For most, the pathway was necessarily self-guided and informally bootstrapped from the resources at hand.

Researchers frequently turned to free and open source content online, such as YouTube video tutorials, blogs like Towards Data Science, and interactive coding sandboxes. They enrolled in free online courses offered through Coursera, Udemy, and LinkedIn. They picked up practical guidebooks about Python and R data science libraries and worked through exercises independently on weekends. A few participated in short-term workshops such as Software Carpentry (Teal et al., 2015; Wilson, 2006). Some predoctoral researchers enrolled in relevant University of Michigan courses outside their home department, though they did not always count toward the degree requirements of their PhD program.

3.1. The Perils of Learning Alone

Independent study was rarely satisfying for researchers, though. Perhaps reassuringly, many expressed they would be unwilling to publish a result using methods they do not personally understand. This high standard was not achievable for many participants working on their own for a variety of reasons.

First, undertaking this independent study was simply too time-intensive. Because modern data science requires a baseline competency in at least some statistics, algebra, programming, computing, and data management skills, it requires more than a casual commitment (Conway, 2010). But for every researcher at every career stage, time spent exploring a new method—which may or may not yield returns in the form of publications or grants—is time away from more immediate duties with more certain rewards. One faculty member reported, “You have faculty who haven't done this, and they want to start. But faculty have very limited schedules.”3 For predoctoral researchers, there may even be discouragement or lack of interest on the part of their supervisor, meaning self-study can detract from progress toward graduation.

Second, the apparent wealth of data science lessons available freely online is disorienting for new learners. While it was trivial to find and practice toy problems in browser-based coding sandboxes, researchers were often unsure which methods, software libraries, and skills were relevant to their scientific work. This may reflect a tendency for online learning materials and communities to center the training of industry data scientists, whose backgrounds and goals differ from those of academic researchers.

Furthermore, upon hitting software bugs, learners were often unable to get themselves unstuck. One coping strategy that arose several times in interviews involved soliciting a domestic partner with some programming experience as a ‘debugging buddy.’ For those without an in-home bug specialist, though, broken code was a major strain on time and morale. Even for researchers who sailed through online coding sandbox problems (hosted in a browser-accessible Jupyter environment, for example), translating the code recipes to their real data, and on their lab computer, elicited new and stickier errors. In the words of one participant, “Very basic algorithms, I can do those. But there’s a gap between the examples and how I apply them in my study.”

For all these reasons, independent study rarely culminated in applications of new techniques to research questions. While each researcher’s commitment to independent study differed, several had persisted for months without materially advancing their research agenda. Based on these personal reflections, it seems unlikely that researcher dedication or tenacity is a bottleneck.

Again and again, researchers asked not for better textbooks, coding examples, classes, or software tools, but for input and expertise from a broader community. They tried to bootstrap this expert community through several relationships, to various degrees of success.

3.2. Mentors and Collaborators

Many researchers in our study specifically sought a collaborator or mentor (particularly if they were relatively junior in their career). Regardless of the exact title our informants used, they frequently spoke of searching for an intellectual partnership with an expert advisor. The ideal mentor or collaborator, though, has rare qualifications. Researchers seek guidance not solely on the correctness of their analyses—there are several programming statistical consulting help desks around campus that already offer this sort of advising—but also on their scientific merits. They want help navigating the marketplace of data science methods to choose approaches and complementary data sets that will reap new ‘fruit’ for their research domain. It is not interesting or useful to the average researcher to correctly implement a machine learning method that reveals only long-understood patterns, or alternatively, a model that predicts well with an inscrutable mechanism. The expertise researchers most desired would have a dual perspective on data science methods and promising research questions in their home domain. But few such ‘connectors’ exist. One participant, a scientist in educational research, said:

I can talk to education people about the ideas, and I can talk to data science folks about the methods, but you're not going to get critical feedback on how those two things intersect a lot because people only know one side.

Although in theory, such a perspective might be found among the many applied research experts in computer science or statistics, participants reported major difficulties establishing cross-disciplinary relationships. One reason may be that academic computer scientists and statisticians are frequently incentivized to push the boundaries of existing methods in their publications, whereas the scientists in our study aim to apply relatively well-established methods from data science to new contexts. While there are surely problems that are mutually rewarding for both parties, they are probably rarer than the scientists in our study would like (and may require an investment in many conversations to surface them).

When researchers did manage to make a cross-disciplinary connection (often after a tedious search involving many cold-email attempts), practical barriers to collaboration quickly arose. At the most basic level, researchers in different disciplines rely on different tool stacks—before data science even enters the conversation, collaborators must agree on GitHub or Dropbox, Google Docs or LaTeX. The overhead to merely establish the terms and strategy of a collaboration led many to fizzle in early planning stages.

Partnerships in which a domain expert effectively outsourced all the programming and data science methods to a collaborator were rarely satisfying to the domain expert. Researchers lamented a lack of transparency into what their counterpart was doing, exactly, or if it was done correctly. Participants reported both inabilities to interpret their collaborator’s code as well as the conceptual framework behind it. In one example, a doctoral trainee described dissatisfaction with her role in a project where data was supplied to her via a machine learning model run by another teammate: “It just feels like this black box mystery that I get data from.” For researchers who wished to audit or check their partner’s analysis (perhaps by creating test cases), the testing loop could be frustratingly inefficient, requiring several meetings just to coordinate and share results. In one case, a researcher simply refused to pursue a publication with a collaborator after the analysis was complete, as the researcher could not sufficiently understand what the analysis meant.

The most successful examples of data science advising and collaboration in this study all involved an experienced guide in the same field. We saw no examples of lasting partnerships with collaborators entirely outside the informant’s domain, although our study is too small to make any reliable estimates about the broader prevalence of any collaboration patterns.

3.2. Students

For many labs, an appealing option to add data science skills to a project is through student research assistants and volunteers. At a research university such as the University of Michigan, there are thousands of students with advanced data science skills and personal interest in gaining research and internship experience (computer science has been the most popular undergraduate major for several years now; Bimer & Katterman, 2022).

Unfortunately, although researchers found it easy to recruit students as volunteers and short-term employees to analyze data sets and build software tools for the lab, it was far more challenging to get valuable research results from these relationships. Students frequently turned over notebooks of code using languages, libraries, and techniques that mentors were unable to interpret. It may require several meetings with a student to understand the mathematical methods they used in an analysis, the appropriateness of the method for the research problem, and if their code correctly performs the intended analysis. Researchers also reported that students are prone to create results with little scientific validity or usefulness. Because they lack sufficient domain knowledge to interrogate the sensibility of results, students cannot perform these checks independently. Curiously, two competing beliefs arose in our interviews: some felt it was easier to train expert programmers in the lab’s scientific domain, while others preferred to teach students from their home field how to code.

From the student’s perspective, the challenges can be just as onerous. One student informant preferred to develop analysis pipelines in R, which offered several advantages for implementing data science methods over the proprietary point-and-click software favored by the lab’s principal investigator (PI). Yet the PI required the student to translate his analyses into the point-and-click workflow for review, which roughly doubled the work for the student.

In general, students were often with labs for too little time to alter the software workflows of a lab or create knowledge that persists after their employment. As soon as the student left the lab, the value of their work rapidly degraded, with no one available to explain their results or guide new users of their code. Among our participants, the toolkit of a lab appears largely set by the PI, and software and methods introduced by fixed-term student researchers were rarely permanent additions.

Notably, two of the most data science–sophisticated research groups represented in the study (including the only lab regularly using Docker images and cloud computing) had an especially high ratio of permanent research staff to students. This pattern is consistent with a perspective recently voiced by Scroggins et al. (2020) on the essential role of skilled, long-term data professionals in maintaining the technical infrastructure required for data-centric research. Ethnographic research has established that there can be considerable invisible labor—labor in which the work, worker, or both are somehow obscured (Star & Strauss, 1999)—involved in ‘cleaning’ and ‘curating’ data sets for scientific use (Plantin, 2018; Thomer et al., 2022). While less attention has been paid to the invisible labor of statistical programming for data analysis or maintaining computational infrastructure in research groups, these endeavors surely also involve skilled work that remains underspecified and undervalued.

3.3. Research Staff

Permanent staff positions, such as research software developers, engineers, and technical system administrators, provide another avenue to build in-house expertise. But besides a few exceptionally well-funded research centers represented in our study, these positions tended to have many of the same limitations as student assistantships in practice.

While these jobs are designed to be long-term, our interviews turned up numerous examples of early career scientists taking the positions as a steppingstone to graduate school or better paying jobs in industry. In labs where the position was filled by a more senior data professional, this arrangement was often the product of circumstantial constraints—for example, an individual seeking a local job after their spouse landed work at the university.

Researchers described challenges in hiring research staff with strong data science skills, particularly due to the lower salary offerings at a university compared with industry. Indeed, three of the informants in this study in research staff roles have left their jobs since they were interviewed. These research staff were all exceptionally skilled in modern data science methods. Two were developing reusable software libraries that would standardize data handling practices for their labs.

Unfortunately, there is yet again evidence that these researchers’ creations—in the form of code, scientific results, or practical knowledge—do not tend to persist long in the collective knowledge of a research group after the creator leaves. One of the authors of a software library for his lab recounted unsuccessful attempts to onboard lab mates: though he had taken care to write documentation, other scientists in the group did not know how to read documentation or use it to debug their own implementations of his code. Without his firsthand guidance, the software posed too high a barrier to entry. When pressed about why make such a library in the first place, the author described the endeavor as a self-motivated project intended to improve the quality of data and code management in the lab.4

At present, research staff may be a force for stability in their labs—but turnover appears common, particularly as data-centric skills command significantly higher salaries elsewhere. Furthermore, without buy-in from lab leaders, the infrastructure created by research staff may be ephemeral.

3.4. Campus Community

At the University of Michigan, there are numerous help desks, consulting groups, workshops, and seminar series for researchers who incorporate programming, statistics, and other aspects of data science into their work. There is also a world class data science institute with a mission to support researchers (MIDAS, the Michigan Institute for Data Science), which hosts community events for networking, skill development, and presenting research. According to many of our informants, though, the variety of campus programs can be itself a barrier—a common theme in interviews was difficulty navigating which service or entity to approach and for what.

Though nearly all participants recognized MIDAS in our interviews, few had attended a MIDAS event. Researchers tended to doubt that they were doing data science at the level appropriate for a data science center, and often self-selected out of participating. The matter of belonging goes even deeper than questioning one’s technical skills: our informants generally did not see themselves as data scientists at all. This lack of identification appears unrelated to the researcher’s actual skills and accomplishments.

A more practical challenge was identifying which events were relevant to a researcher’s work, as advertisements for data science events frequently contained unfamiliar terminology. Many researchers were simply unwilling to gamble their limited time on events that might prove insufficiently related to their research.

3.5. A Note on Culture and Gender

In our interviews, experiences differed along gendered lines. Women (both cis- and transgender) noted substantially more negative experiences interacting with instructors, classmates, colleagues, lab mates, and help desks about data science topics. Women reported being spoken down to and underestimated when they sought feedback, guidance, and training. Some described heightened opposition and challenges from male peers when they participated in advanced programming and statistical training. As one participant described,

If a woman makes a mistake—if a minoritized individual makes a mistake—they have to prove they’re not dumb. When if someone else makes the mistake, the assumption is, ‘Oh, that’s a one time thing, it’s okay.’ That has happened in almost every interaction about coding I’ve had.

This is perhaps unsurprising considering the large literature on gendered and racialized discrimination in the history of computing (Margolis et al., 2003; McGee et al., 2022; Yamaguchi & Burge, 2019). Notably, though, none of our participants were computer scientists—and most belonged to departments with roughly even (or favorable to women) gender balances, according to statistics kept by the University of Michigan. When pioneering scientists seek to add data science methods to their home research culture, they may have to confront the entrenched cultures of fields like computer science and mathematics. These experiences were often discouraging for the researchers in our study and slowed—or even halted—their exploration.

Our study design and sample are inadequate to speak as clearly to the experiences of racially minoritized scientists. That said, we anticipate that well-studied patterns of bias and discrimination in the sciences would continue to occur in this context, too.

4. What to Do About It

The barriers our participants reported are complex, and most lack simple fixes. But our conversations indicated several promising directions for organizations to enhance their support for researchers over time. Again, we wish to emphasize the provisional nature of these suggestions—further research will be needed to establish their effectiveness and generalizability to other universities.

4.1. Teach Different Computing Skills

Nearly all researchers in our study had taken a university coding class at some point in their education, but many still reported insurmountable barriers to adopting new software libraries and computing methods. Current scientific computing classes may be overly focused on producing clearly defined results in a narrow suite of tools, such as recipes for common statistical tests. Rather than emphasizing ‘correct’ procedures in a language or tool, classes might emphasize foundational processes such as debugging, thinking algorithmically, and testing data and code for expected patterns and behaviors (for another perspective, see Connolly et al., 2023). These techniques may be reinforced throughout several courses in a scientific training sequence, even those that do not explicitly involve writing code (similar to recommendations by Haas et al., 2019). We expect this emphasis will better support researchers as they independently explore new programmatic tools during their careers.

4.2. Emphasize Processes, Not Identities

Initiatives to grow research capacity—such as campus data science centers and federal funding opportunities—will benefit from including scientists from a broad array of disciplines. Yet many scientists do not view themselves as doing ‘data science’ at the appropriate level and will count themselves out of such initiatives, regardless of how relevant their work may truly be. Similarly, many who are transitioning to larger data sets will argue that their data is not ‘big’ enough to be considered ‘big data,’ as if there were a universally agreed upon threshold (Borgman, 2020).

To recruit and retain scientists working on topics like ecology, education, urban planning, and political science (to name just a few), data science initiatives need to take care to cast a broad net. Advertisements and invitations should go beyond terms like ‘data science,’ ‘big data,’ ‘machine learning,’ and ‘eScience.’ Rather than emphasize scientific identity, messaging can emphasize processes (e.g., learning about new analyses methods for a given data type, managing research data sets, or building code pipelines) that will appeal across disciplines. As much as possible, messaging should explicitly welcome a broad range of disciplines and experience levels.

A practical strategy for positioning events and workshops may be focusing on common problems. For example, the matter of merging two or more imperfect data sets (sometimes called ‘data harmonization’) is commonly found across disciplines. A campus data science institute could hold workshops or seminars where scientists share their strategies for overcoming data harmonization problems. This paradigm differs from the typical research talk, where scientists present their research narratives in domain-specific terms.

4.3. Invest in Within-Community Expertise

In our study, the most successful collaborations between data science–curious researchers and data science experts were within the same (or closely related) scientific field. This is likely due to a combination of factors, including lower overhead to collaboration, better-aligned research incentives, and higher quality mentorship about the scientific validity and usefulness of new methods, compared to experts from more distant fields. Notably, within-discipline data science experts provided meaningful support even through relatively infrequent meetings, where they could act in an advisory capacity. This suggests research fields may need only a handful of ‘connectors’ with the data science community to yield major dividends. Professional societies may play a role in encouraging and cultivating this expertise. Funding initiatives could also enable training opportunities for a few scientists with strong domain expertise to develop a data science skill set.

4.4. Enrich Career Paths for Research Data Scientists

Research staff such as engineers, data managers, and developers provide critical knowledge, stability, and support for labs with complex data sets and analysis pipelines (Borgman et al., 2014; Scroggins et al., 2020). Yet our interviews suggest that these jobs do not offer sufficient pay or advancement opportunities to keep highly skilled data professionals in the academy, even at a major research university. Creating career tracks for long-term research data scientists at universities may help prevent the brain drain. These positions should target professionals with a mixture of data science skills and domain knowledge. To be attractive to individuals like our study participants, they should have pathways to promotion, reasonably competitive salary and benefits, and opportunities for intellectual independence and fulfillment.

5. Limitations of This Research

Although our interviews were quite informative, our findings should be interpreted with a few considerations in mind. First, our sample is limited to one university—specifically, a large, public, R1 (very high research activity) school in the United States with an active campus data science initiative. Our findings may not generalize straightforwardly to different academic settings. Our sample is also relatively small and nonrandom. As such, it should not be taken as a survey of researcher’s attitudes broadly, but as a collection of narratives from scientists at an exceptionally well-supported research university.

Second, our interviews were all conducted before a wave of excitement about large language models, which may offer researchers new tools to support both independent learning and programming (though this remains to be assessed).

Finally, our research does not address whether data science methods are a worthwhile investment for most research groups. It is not obvious that the hoped-for scientific value of these methods will be realized. Indeed, it is easy to imagine that for some research projects in the study, techniques like unsupervised learning may turn up only well-known patterns, or predictive modeling with deep learning may not exceed what can be accomplished with longstanding (and far less complex) models. The scientific merit of pursuing new approaches will likely have to be evaluated within each research domain by practitioners.

Although it remains possible that some data science methods will offer limited scientific value to researchers, the very act of exploring them may bring researchers into increased contact with strong foundational software practices that proliferate in the open source world (such as coding for readability, writing documentation, and version control). Similarly, entertaining larger data sets may push researchers to develop stronger data management practices that benefit even ‘small’ data management. We are optimistic that these ventures will reward researchers—though perhaps not exactly in the ways they expected.


Acknowledgments

We thank each of our participants for their time and interest in this project. Thank you to the Alfred P. Sloan Foundations’ Digital Technology team and grant reviewers for valuable feedback in developing this study.

Disclosure Statement

This research was founded by a grant from the Alfred P. Sloan Foundation (G-2021-17107).


References

Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B., & Zimmermann, T. (2019). Software engineering for machine learning: A case study. In Proceedings - 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice, ICSE-SEIP 2019 (pp. 291–300). IEEE. https://doi.org/10.1109/ICSE-SEIP.2019.00042

Anderson, C. (2008, June 23). The end of theory: The data deluge makes the scientific method obsolete. Wired Magazine.

Bimer, T., & Katterman, L. (2022). The Michigan Almanac. Regents of the University of Michigan.

Borgman, C. L. (2020). Big data, little data, or no data? Why human interaction with data is a hard problem. In H. O’Brien & L. Freund (Eds.), CHIIR '20: Proceedings of the 2020 Conference on Human Information Interaction and Retrieval (p. 1). ACM. https://doi.org/10.1145/3343413.3377979

Borgman, C. L., Darch, P. T., Sands, A. E., Wallis, J. C., & Traweek, S. (2014). The ups and downs of knowledge infrastructures in science: Implications for data management. In Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (pp. 257–266). IEEE. https://doi.org/10.1109/JCDL.2014.6970177

Bush, R., Dutton, A., Evans, M., Loft, R., & Schmidt, G. A. (2020). Perspectives on data reproducibility and replicability in paleoclimate and climate science. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.00cd8f85

Connolly, A., Hellerstein, J., Alterman, N., Beck, D., Fatland, R., Lazowska, E., Mandava, V., & Stone, S. (2023). Software engineering practices in academia: Promoting the 3Rs—Readability, resilience, and reuse. Harvard Data Science Review, 5(2). https://doi.org/10.1162/99608f92.018bf012

Conway, D. (2010, September 30). The data science Venn diagram. drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Frank, M., Drikakis, D., & Charissis, V. (2020). Machine-learning methods for computational science and engineering. Computation, 8(1), Article 15. https://doi.org/10.3390/computation8010015

Grossi, V., Giannotti, F., Pedreschi, D., Manghi, P., Pagano, P., & Assante, M. (2021). Data science: A game changer for science and innovation. International Journal of Data Science and Analytics, 11(4), 263–278. https://doi.org/10.1007/s41060-020-00240-2

Haas, L., Hero, A., & Lue, R. (2019). Highlights of the National Academies report on “Undergraduate data science: Opportunities and options.” Harvard Data Science Review, 1(1). https://doi.org/10.1162/99608f92.38f16b68

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning: With applications in R. Springer.

Margolis, J., & Allan, F. (2003). Unlocking the clubhouse: Women in computing. MIT Press.

McGee, E. O., Botchway, P. K., Naphan-Kingery, D. E., Brockman, A. J., Houston, S., & White, D. T. (2022). Racism camouflaged as impostorism and the impact on Black STEM doctoral students. Race Ethnicity and Education, 25(4), 487–507. https://doi.org/10.1080/13613324.2021.1924137

National Academies of Sciences, Engineering, and Medicine. (2019). Reproducibility and replicability in science. National Academies Press.

Plantin, J. C. (2018). Data cleaners for pristine datasets: Visibility and invisibility of data processors in social science. Science, Technology, & Human Values, 44(1), 52–73. https://doi.org/10.1177/0162243918781268

Saldaña, J. (2016). The coding manual for qualitative researchers. Sage.

Scroggins, M. J., Pasquetto, I. V., Geiger, R. S., Boscoe, B. M., Darch, P. T., Cabasse-Mazel, C., Thompson, C., Golshan, M. S., & Borgman, C. L. (2020). Thorny problems in data (-intensive) science. Communications of the ACM, 63(8), 30–32. https://doi.org/10.1145/3408047

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., & Young, M. (2014). Machine learning: The high-interest credit card of technical debt. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Eds.), NIPS 2014 Workshop on Software Engineering for Machine Learning (SE4ML)(pp. 2494–2502). IEEE, NeurIPS.

Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J. F., & Dennison, D. (2015). Hidden technical debt in machine learning systems. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & R. Garnett (Eds.), Advances in Neural Information Processing Systems 28 (NIPS 2015) (pp. 2503–2511). IEEE, NeurIPS.

Star, S. L., & Strauss, A. (1999). Layers of silence, arenas of voice: The ecology of visible and invisible work. Computer Supported Cooperative Work (CSCW), 8, 9–30.

Tatman, R., & Vanderplas, J. (2018). A practical taxonomy of reproducibility for machine learning research. Paper presented at the 2nd Reproducibility in Machine Learning Workshop at ICML 2018, Stockholm, Sweden.

Teal, T. K., Cranston, K. A., Lapp, H., White, E., Wilson, G., Ram, K., & Pawlik, A. (2015). Data carpentry: Workshops to increase data literacy for researchers. International Journal of Digital Curation, 10(1). https://doi.org/10.2218/ijdc.v10i1.351

Thomer, A. K., York, J. J., B Tyler, A. R., Yakel, E., Akmon, D., Polasek, F., Lafia, S., & Hemphill, L. (2022). The craft and coordination of data curation: Complicating workflow views of data science. Proceedings of the ACM on Human-Computer Interaction, 6(CSCW2), Article 414. https://doi.org/10.1145/3555139

Topol, E. J. (2019). High-performance medicine: The convergence of human and artificial intelligence. Nature Medicine, 25(1), 44–56. https://doi.org/10.1038/s41591-018-0300-7

Willis, C., & Stodden, V. (2020). Trust but verify: How to leverage policies, workflows, and infrastructure to ensure computational reproducibility in publication. Harvard Data Science Review, 2(4). https://doi.org/10.1162/99608f92.25982dcf

Wilson, G. (2006). Software carpentry: Getting scientists to write better code by making them more productive. Computing in Science and Engineering, 8(6), 66–69. https://doi.org/10.1109/MCSE.2006.122

Yamaguchi, R., & Burge, J. D. (2019). Intersectionality in the narratives of black women in computing through the education and workforce pipeline. Journal for Multicultural Education, 13(3), 215–235. https://doi.org/10.1108/JME-07-2018-0042

Yang, C., Huang, Q., Li, Z., Liu, K., & Hu, F. (2017). Big data and cloud computing: Innovation opportunities and challenges. International Journal of Digital Earth, 10(1), 13–53. https://doi.org/10.1080/17538947.2016.1239771


Data Repository/Code

A GitHub repository with a data file to reproduce participant demographic tables, plus the semistructured interview guide, is available at https://github.com/elleobrien/data_science_is_lonely.


Supplementary Files

See Supplement for detailed methods and participant demographics.


©2024 Elle O’Brien and Jordan Mick. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

Comments
0
comment
No comments here
Why not start the discussion?