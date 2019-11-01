The supplementary materials section contains a detailed explanation of the qualitative research methods used in CENS and DataFace, a summary of the findings for each research question, and a full bibliography that includes references not cited in the main text of the article.

1. Research Methods Section

1.1. Ethnographic Observation and Document Analyses

Ethnographic work reported in this article includes observations of activities in laboratories and in field deployments, laboratory and community meetings, and other events. Data practices researchers also observed CENS (Center for Embedded Networked Sensing) and DataFace community members during formal gatherings such as research reviews and retreats, weekly research seminars, and informal gatherings such as discussions within the lab and offices of CENS and DataFace. Throughout our engagement with these communities, we gathered public and private documentation of their work, ranging from publications and websites to equipment specifications, lab notes, and working documentation provided by our research subjects. In both consortia, we conducted about two years of ethnographic observation to understand their data practices before designing and conducting interviews. We returned to our ethnographic notes, memos, and documentation on data reuse issues, using data reported in our publications as a guide. We collected documents at every opportunity, analyzing them in concert with other forms of evidence.

1.2. Semi-structured Interviews

Over the course of our studies of CENS and DataFace, we conducted a total of 127 interviews that touched on topics of data reuse. These qualitative, open-ended interviews ranged from 45 minutes to 2 hours, with an average of 60 minutes per interview. Members of the UCLA Center for Knowledge Infrastructures and its predecessor labs conducted the interviews. All interviews were based on a shared protocol, which enables us to integrate data across these studies. Among the interview questions relevant to this comparative analysis are these: Do you use data you did not generate yourself? To what end do you use data you did not generate yourself? Where do you find these data? How do you find the data? How do you make sense of these data? What kind of analysis do you conduct on others’ data? Please describe how you use others’ data in your daily research or in your research workflows. Taken together, answers to these and similar interview questions provided self-reports that we compared to our observations of practices and reports of research activity in publications, reports, and other documents.

Systematic samples are difficult to achieve in long-term qualitative research. In these studies, we endeavored to interview participants from a broad range of disciplinary affiliations and career stages. For the CENS collaboration, participants were selected from technology research or scientific applications. Researchers in the areas of ecology, biology, marine sciences, seismology, and environmental engineering were classified in the sciences (Wallis, Rolando, & Borgman, 2013). Technology participants include those in computer science, electrical engineering, robotics, and related areas (Borgman, Wallis, & Mayernik, 2012; Mayernik, Wallis, & Borgman, 2013). For the first round of CENS interviews in 2005–2006, 22 participants were selected using stratified random sampling based on whether their research fell within the realm of science or technology (Borgman, Wallis, & Enyedy, 2007). For interviews conducted in 2009–2010, 21 participants were selected using stratified random sampling based on degrees of centrality in a coauthorship network constructed using CENS publications (Borgman, Bowker, Finholt, & Wallis, 2009; Borgman et al., 2012; Pepe, 2010, 2011). For interviews conducted in 2012, seven research projects were selected using stratified random sampling of CENS research groups based on whether they were classified as technical or scientific (Wallis, 2012). For the latter study, 34 authors were interviewed, drawn from one representative publication of each research project.

For the DataFace collaboration, ethnographic fieldwork was conducted at nine of the 11 DataFace sites: one engineering hub, two technology centers, and six laboratories. The engineering hub was responsible for creating a centralized open repository for DataFace data. This hub developed data models, metadata schemas, and a data search engine. One of the technology centers developed the ontological schemas used to describe DataFace data. The other technology center developed the data analysis software. The six laboratories selected for the study varied by disciplinary affiliations, data types produced and consumed, and model organisms used in the experiments. Participants’ disciplinary backgrounds spanned clinical genetics, computer science, engineering, dentistry, plastic surgery, and developmental biology. They produced and reused a variety of datasets, such as 3D facial images (microCT, TIFF), facial measurements, gene expression data and drawings, annotation data on gene functions, RNA-seq, and ChIP-seq data. Interviewees collected data from four animal models: zebrafish, mouse, chimpanzees, and humans.

For each of the nine teams selected, Pasquetto (2018) interviewed the lead scientist, a lab manager, and one or more doctoral students or postdoctoral students. Most of the DataFace teams included no more than five individuals. Ethnography at each lab, averaging 10-day visits, included observing team meetings and spending recreational time with participants. Ethnography of the DataFace consortium also included participating in four annual all-hands meetings, giving presentations, and informal interactions.

1.3. Qualitative Data Analysis

Throughout these studies, interviews were audio-recorded, transcribed, and complemented by the interviewers’ memos on noteworthy topics and themes. Transcription of the CENS interviews from Round 1 (2005–2006) totaled 312 pages, the transcription of the Round 2 (2009–2010) interviews totaled 406 pages, and the transcription of Round 3 (2012) totaled 686 pages. Transcription of the DataFace Interviews (2016–2018) totaled 726 pages. Ethnographic notes and documentation consume several file cabinets and considerable disk space.

Overall, we conducted analytical coding of notes, memos, interviews, and other texts with NVIVO software for qualitative research (QRS International, 2011). As the CENS research on data practices evolved into a longitudinal study, we developed a set of analytical categories for observation, interview protocols, and a codebook. These analytical categories from CENS were the initial basis for studies of DataFace and sites in other sciences. Full details of our data analysis processes are reported in the publications cited throughout the article.

The CENS study used the methods of grounded theory (Glaser & Strauss, 1967) to identify themes and to test them in the full corpus of interview transcript and notes. Prior to our first formal round of interviews discussed in the article, we had already been members and observers of the CENS community for four years (Borgman, Wallis, & Enyedy, 2007). We examined these initial notes, informal interviews, and texts to identify emergent themes, then tested and refined these themes iteratively in the coding of subsequent interviews (Borgman, Wallis, & Enyedy 2007; Wallis et al., 2007). For each round of interviews, we worked with the existing codebook to test and refine themes in coding of subsequent interviews (Mayernik et al., 2013). With each refinement, the remaining interviews were searched for confirming or contradictory evidence. A similar process was employed to analyze DataFace’s interviews and field notes. Full details of methods and analysis for the DataFace study are reported in Pasquetto (2018).

All of these analytical tools and protocols evolved over the 16 years of research reported here. Each new grant had its own focus, and each dissertation addressed specific research questions within the scope of current funding. NVIVO software was upgraded several times, which required data migration. Similarly, computing platforms were upgraded on a regular basis. The advantages of large, long-term, distributed qualitative studies—which are rare—are opportunities for multiple comparisons and for theoretical development. The disadvantages are the changing circumstances of the sites, turnover in research personnel conducting the studies (graduate students and postdoctoral fellows), differing expectations of funding agencies supporting the research program, and changes in technology. We have accommodated these disadvantages by paying close attention to variances within and between individual studies, respecting the power of our sample size, and acknowledging the limitations of the study in our conclusions.

2. Summary of Findings

For the first research question, “Where do scientists find reusable data?” we found that sources varied widely by purpose, project, and individual researcher. At the time of CENS research, from 2002 to 2012, data deposit was required only in genomics and seismology. Few of the other domain areas in CENS had archives of research data on which to draw. However, they did make regular use of databases that contained observations of the natural world, such as the U.S. Geological Survey (USGS), National Oceans and Atmospheres Administration (NOAA), and domain-specific databases such as bird sounds. DataFace, a project that began seven years later (2009), and in a domain with a long history of data archiving, made extensive use of open archives in biomedicine. Scientists in both CENS and DataFace also asked other researchers for access to their data. Sometimes these contacts were identified through publications or presentations; other times through personal knowledge of others’ research.

Our second research question, “How do scientists reuse others’ data?” revealed a continuum of purposes for data reuse, ranging from comparative to integrative. In both CENS and DataFace, uses of external data sources for comparative purposes were by far the most common. Some of their data sources were observations collected and curated expressly for comparative purposes, such as the USGS, NOAA, ClinVar, GenBank, and GWAS (Genome Wide Association Study) databases. These were essential sources of data for ground-truthing, calibration, and comparison. Archives of data deposited by individual researchers, such as OMIM (Online Mendelian Inheritance in Man), also were useful for comparisons. Data and literature searches were often conducted together as background for new studies. At the integrative end of the continuum of data uses, researchers reused data for new analyses, alone or in combination with other datasets. This continuum emerged in both CENS and DataFace, a decade apart, in different research domains.

Our third research question, “How do scientists interpret others’ data?” yielded the most complex findings, as expected. In both CENS and DataFace, researchers generally were able to reuse data from archives for comparative purposes, provided the documentation was sufficient for the particular application. When they wished to integrate data created by others, however, published documentation of data was less likely to suffice for interpretation. In these cases, reusers typically collaborated with data creators to conduct new analyses, test new hypotheses, or combine data from multiple studies.

3. Full Bibliography

The supplemental bibliography includes all references cited in the main text of the article and other background materials.

Disclosure Statement

Irene V. Pasquetto, Christine L. Borgman, and Morgan F. Wofford have no financial or non-financial disclosures to share for this article.

©2019 Irene V. Pasquetto, Christine L. Borgman, and Morgan F. Wofford. This supplement is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the supplement.