Lost or found? Discovering data needed for research

Finding or discovering data is a necessary precursor to being able to reuse data, although relatively little large-scale empirical evidence exists about how researchers discover, make sense of and (re)use data for research. This study presents evidence from the largest known survey investigating how researchers discover and use data that they do not create themselves. We examine the data needs and discovery strategies of respondents, propose a typology for data (re)use and probe the role of social interactions and other research practices in data discovery, with the aim of informing the design of community-centric solutions and policies.


Introduction
Stakeholders from funders to researchers are increasingly concerned with the sharing and reuse of research data (e.g.Digital Curation Centre, n.d.; Tenopir et al., 2015).Policy makers draft guidelines, systems designers create repositories and tools, and librarians develop training materials to encourage opening and sharing data, often without empirical evidence about community-specific practices (Noorman, Wessel, Sveinsdottir, & Wyatt, 2018).It is assumed that data can and will be reused if they are shared (Borgman, 2015a).Another assumption predicating reuse is that data will actually be discovered by researchers, although relatively little empirical work exists to support this assumption.
In this article, we present the results of the largest known survey examining how researchers discover and (re)use research data that they do not create themselves, so-called secondary data (Allen, 2017).We consider commonalities in practices but also examine differences, looking at how data needs and search practices vary not only by disciplinary domain but also by types of data uses.Past work has explored data search practices via in-depth interviews (Koesten, Kacprzak, Tennison, & Simperl, 2017;Borgman, Scharnhorst, & Golshan, 2019;Gregory, Cousijn, Groth, Scharnhorst, & Wyatt, 2019a).This study employs a broader approach, using a globally distributed multidisciplinary survey, with nearly 1700 respondents, to explore these practices at a larger scale.
We draw on the qualitative as well as quantitative portions of the survey to paint a detailed picture of data discovery and propose a typology for data (re)use which we use to explore the data needs and practices of participants.We also probe the role of social interactions in searching for data and explore how data search is related to other practices, such as searching for academic literature.We conclude by considering how communities of data seekers can be conceptualized, and discuss our findings in light of recent efforts to increase the discoverability of research data.

Background
Although information seeking is an extensive research field, work investigating data-seeking practices is nascent.Practices of data seeking, also referred to as data search or data discovery practices, are commonly examined through user studies of particular data platforms and repositories (e.g.Borgman et al., 2019;Murillo, 2014;Wu, Psomopoulos, Jodha Khalsa, & de Waard, 2019).
Zimmerman investigates data search practices directly, looking at the needs, discovery strategies, and criteria for evaluating data for reuse for a small group of environmental scientists.(2003,2007,2008).Recent work characterizes data search and evaluation practices across disciplinary domains (Gregory et al, 2019a;Gregory, Cousijn, Groth, Scharnhorst, & Wyatt,, 2019b) and by data professionals within and outside of academia (Koesten et al., 2017), relying primarily on in-depth interviews with data seekers or log analyses (Kacprzak, Koesten, Ibáñez, Simperl, & Tennison, 2017).
Publishers and data repositories conduct annual surveys tracing data sharing and management practices over time (e.g.Digital Science et al., 2018;Berghmans et al., 2017).Information about data search strategies, criteria important for reuse, and the role of social communications is found within surveys designed to develop data metrics (Kratz & Strasser, 2015) and to determine factors affecting data reuse (Kim & Yoon, 2017;Yoon, 2017).
Interest in designing tools specifically for data search is growing (Chapman et al., 2019), evidenced by the development of search engines for research data (Noy, Burgess, & Brickley, 2019;Scerri et al., 2017).Despite this trend, the limited amount of user interaction data restricts how these search tools are developed (Noy et al., 2019).There are also a growing number of policies regulating open data and data sharing (European Commission, 2019), which are seen as precursors to creating the ecosystem necessary for data discovery (Borgman, 2015b).These policies often do not accurately reflect the way that opening data and data sharing are enacted within communities (Noorman et al., 2018).
The sustainability and adoption of both search systems and data policies rely on understanding and building on extant practices (Schatzki, Knorr Cetina, & van Savigny, 2001).Our work aims to provide evidence of practices of data seeking and to inform the design of community-centric solutions and policies.

Survey design
We drew heavily on the findings of our earlier interviews with data seekers (Gregory et al, 2019a) and our analytical literature review (Gregory et al, 2019b) to design a survey addressing our principle research questions (see Table 1).Our research questions were informed by user-centered models of interactive information retrieval (e.g.Ingwersen, 1992;1996;Belkin, 1993;1996), particularly the synthesized model of an information journey proposed by (Blandford & Attfield, 2010;Adams & Blandford, 2005) which generally posits an actor/user with an (at times unrecognized) information need who engages in an iterative process of discovery, evaluation and use.
Our survey employed a branching design consisting of a maximum of 28 individual items; nine of these items were constructed to allow for multiple responses.In addition to write-in responses for expanding on "other" answers, the survey included two open-response questions.Respondents working as librarians or in research/data support also answered a slightly modified version of the survey. 1 Four multiple response questions and their associated variables are of particular importance in our analysis; we include an overview of these variables to aid in navigating our results (Table 2).

1
The survey instrument is available in Appendix A of this preprint.

Research questions Items
RQ 1: Who are the people seeking data?Q1, L3, L4, L5  The survey was scripted with the Confirmit survey design software (https://www.confirmit.com).We piloted the survey instrument in two phases.We scheduled appointments with four researchers, recruited via convenience sampling, and observed them as they completed the online survey.During these observations, we encouraged participants to think out-loud and ask questions.We used these comments to modify our questions before the next pilot phase.We then recruited an initial sample of 10,000 participants using the recruitment methodology detailed below.Once one hundred participants had begun the survey, we measured the overall completion rate (41%), taking note of the points in the survey where people stopped completing questions.We used this information to further streamline the question order and to clarify the wording of some questions before recruiting our sample.

Sampling and recruitment
We sent recruitment emails to a random sample of an additional 150,000 authors who are indexed in Elsevier's Scopus database and who have published in the past three years.The recruitment sample was constructed to mirror the distribution of published authors by country within Scopus.Recruitment emails were sent in two batches, one of 100,000 and the other of 50,000, two weeks after the first batch.One reminder email was sent to encourage participation.A member of the Elsevier Research and Academic Relations team 2 created the sample and sent the recruitment letter, as access to the authors' email addresses was restricted.Potential participants were informed that the purpose of the 2 Ricardo Moreira, Elsevier Research and Academic Relations survey was to investigate data discovery practices.We therefore assume that participants who completed the survey are in fact data seekers.We received 1637 complete responses during a fourweek survey period in September-October 2018.We recruited an additional 40 participants by posting to discussion lists in the library and research data management communities, for a total sample of 1677 complete responses.

Analysis
We used the statistical program R to perform our analysis, in particular the MRCV package (Koziol & Builder, 2014a;Koziol & Builder, 2014b) to analyze questions with multiple possible responses.For these questions, we tested for multiple marginal independence (MMI) or simultaneous pairwise marginal independence (SPMI) between variables using the Bonferroni correction method.This method calculates a Bonferroni-adjusted p-value (Dunn, 1961) for each possible 2x2 contingency x 2 tables to the set significance level (! = 0.05).(Bilder, & Loughin, 2015).This test can produce overly conservative results, particularly when analyzing questions with many variables.Nonetheless, this approach is preferred to traditional tests for independence, as it takes into consideration the fact that a single individual can contribute to multiple counts within a contingency table.We coded and analyzed open response questions in NVivo using a combined deductive and general inductive approach to thematic analysis (Thomas, 2006) and used R, Gephi and Tableau to create plots and visualizations.

Reporting
Significant associations between variables are often reported by listing p-values in tabular format for each combination of possible responses.Due to the complexity of our survey design, in particular the number of multiple response variables (Table 2), we present significant associations using visualizations.These visualizations indicate if an association between variables exists; they do not indicate the value of the p-values themselves.We do this in order to increase the understandability of our results and to make them usable for a wider audience.Tables with the p-values are included in the preprint version of this article.3

Ethics
We received ethical approval from Maastricht University for the study.Participants had the opportunity to review the informed consent form prior to beginning the survey and indicated their consent by proceeding to the first page of questions.

Results and Analysis
We present our results according to the research questions presented in Table 1.We first examine characteristics of the data seekers responding to our survey, and then proceed to look at their data needs, search and discovery strategies, and evaluation and sensemaking practices.
RQ 1: Who are the people seeking data?

Respondents are globally distributed and have research experience
Respondents employed in 105 countries completed the survey.The United States, Italy, Brazil and the United Kingdom were among the most represented countries (Figure 1a).This does not directly correspond with the sampling distribution (Figure 1b), where the largest difference between recruited participants and respondents occurred in China.This lower response rate could be due to language differences, perceived power differences (Harzing, 2006), or a lack of tradition in responding to research requests from researchers from other countries (Wang & Saunders, 2012).It could also indicate that data seeking is not a common practice.
The majority of survey respondents were researchers (82%), work in universities (69%), and have been professionally active for 6-15 years (40%) (Table 3).With the exception of participants recruited specifically from the library and research data management communities (or "research support professionals"), our recruitment methodology ensures that all respondents are published authors, making it likely that they have been involved in conducting research in either their past or current roles.Nearly half of those working in research support also need secondary data for their own research, in addition to supporting researchers or students.

Respondents support data sharing and reuse
While eighty percent of all respondents have shared their own data in the past, participants with longer careers have done so slightly more often.Eighty-nine percent of respondents who have worked for 31+ years have shared their data, compared to 77% percent of respondents working for less than five years.Personal attitudes towards data sharing and reuse differ from the perceived attitudes of peers, disciplinary communities and institutions (Figure 2).The majority of survey respondents are proponents of sharing their research data; they believe that other groups are less supportive of data sharing.Respondents believe data sharing is more strongly supported by their direct co-workers than by their disciplinary communities or institutions; they are most uncertain about the attitudes of their institutions.A similar pattern exists for attitudes toward data reuse, although there is more uncertainty involved.

Respondents belong to multiple, overlapping domains
Respondents indicated their domain(s) of specialization from a list of 31 possibilities following (Berghmans et al., 2017) .Engineering and technology was selected most often, followed by biological, environmental, and social sciences (Figure 3).Choosing a single domain that captures the complexity of participants' expertise, even from a list of 31 options, was challenging.Approximately half of the respondents selected two or more domains, with one quarter selecting more than three.
Figure 3 depicts the disciplinary overlaps among respondents, showing which domains were selected in conjunction with each other.
We also apply a community-detection algorithm (Blondel, Guillaume, Lambiotte, & Lefebvre, 2008) in Figure 3 to indicate groups of participants who selected similar disciplines.Based on this analysis, we see expected groupings, e.g. in the life sciences, but we also see unexpected clusters of behavior.
Respondents selecting computer science, e.g., have common behaviors with those selecting the arts and humanities, rather than with respondents selecting mathematics or engineering & technology.
This could be indicative of the use of digital humanities methodologies among our participants.The connections between engineering and technology and all other community groupings are also visible in Figure 3; seventy-one percent of respondents who selected engineering and technology chose at least one other discipline.To provide a high-level view of their data needs, respondents selected the type(s) of data that they need in their work from a list derived from the categories of research data proposed by the National Science Board (2005) and used in (Berghmans et al., 2017;Gregory et al, 2019b).While observational/empirical data were selected most frequently, 50% of participants also indicated that they need more than one type of data.Figure 4 represents the number of respondents selecting individual and multiple data types.We detected significant associations (all at p<0.001) between needing observational data and needing experimental data; between needing experimental and simulated data, and between needing derived and simulated data.We asked participants to further expand on their data needs in an open response question, which seventy-eight percent of respondents completed.We compared these responses to the types of data that participants selected, paying special attention to those who chose multiple data categories.
Respondents selecting multiple data types appear to do so for different reasons.Some require a variety of topic-specific data to conduct their research; others use a variety of data, regardless of topic, as long as it matches format and structure requirements.The following participant indicated needing observational, experimental and derived data.
[I need] large and small datasets that students can use for Data Science skill development.
For example, transactional data from business, medical data such as interactions between patients and doctors, descriptive medical histories.Data need to be complex enough to be interesting and able to be parsed.(Respondent ID 613).
Data are difficult to categorize (Borgman, 2015b); in part because people may define data categories differently.While the majority of respondents stating they use census data selected observational data, others did not, choosing instead derived/compiled data.While the majority of individuals using literature corpora selected derived/compiled data, a minority selected "other," apparently not knowing which category best fit their data.
Significant associations (detected using the statistical test described in the Methodology section above) between disciplinary domain and the type of data needed are shown in Figure 5.The highest number of disciplinary associations were detected for experimental data; arts and humanities was associated with the greatest number of data types, although surprisingly not with observational data.
There are also domains where no significant associations were detected.This could be due to the composition of our sample; it could also be taken as evidence for the diversity of data types that participants within disciplines need.Slightly more than half of respondents reported needing data outside of their disciplines.This pattern holds across domains, with the exception of environmental science, where more individuals need data external to their domain (65%), and medicine, where the trend was reversed, with less than half needing this type of data (43%).

Data uses are surprising and expected
Although using data in ways that support research is well-documented (e.g.Wallis, Rolando, & Borgman, 2013), using data to drive new research or in teaching is not as well documented (Gregory et al, 2019b).Surprisingly, our respondents selected using data as the basis for a new study most often, followed by teaching and preparing for a new project or proposal (Figure 6).Other common uses include to experiment with new methodologies and techniques such as developing data science skills or to complete particular data-related tasks, i.e. trend identification or creating data summaries (as suggested by Koesten et al., 2017).Less than 0.5% of responses indicated needing data for other purposes, suggesting that the list of uses in our survey is fairly complete.

Data uses span research phases
Table 4 identifies significant associations, detected as described in the Methodology section, between the types of data uses shown in Figure 6, and types of needed data (Table 4: Associated with these types of needed data) and other data uses (Table 4: Associated with these types of data uses).We present this information along a typology based on different phases of research.In this typology, we agree with van de Sandt, Dallmeier-Tiessen, Lavasa, & Petras ( 2019) that reuse is one type of data use, but similar to Fear (2013), we view reuse as being the basis for a new study.
Depending on research practices, the data uses in our typology could occur in different phases.Uses that could particularly fall into two bordering categories are marked in grey.Integrating different datasets could occur when conducting research, e.g., or verification of data could also be considered to be part of data analysis.Although not marked in grey, data analysis and sense-making tasks are likely to occur throughout all phases of research.This is reflected in the results, where analysis activities are associated with every other data use, with the exception of instrument or model calibration.Our typology has similarities with models of the research lifecycle process (e.g.Jisc, 2013), but also differs from these models.Typical models of research lifecycles portray research that uses data created by the researcher, rather than secondary data.They also tend to reduce the complexity of research processes, ignoring the interwoven nature of tasks involved in research (Cox & Wan Ting Tam, 2018) and depict data cycles as independent workflows (e.g.UK Data Service, 2019).With this typology, we attempt to nuance the uses of secondary data throughout phases of research, recognizing how data uses are associated with multiple data types and uses and highlighting that these uses are associated with multiple work phases.
Using data as the basis for a new study is associated in our results with needing observational data.
Observational data are not significantly associated with research execution tasks, but they are associated with activities in both data analysis and with teaching.Experimental data are related to uses involved in conducting research, as well as to comparison.Derived data are associated with all activities in both the research process and data analysis/sensemaking phases; simulation data are primarily associated with conducting research.
Using data as the basis for a new study is associated with uses such as project creation and preparation, data analysis and sense-making, or teaching, but is not associated with any uses in the research execution phase.Teaching is associated with only a few other uses, most of which fall within the project creation phase.Data integration is associated with activities across phases of research; calibration, however, is exclusively associated with other research process tasks.

Data uses are common, but their enactments are complex
Disciplinary domains also shape how data are used.Figure 7 indicates significant associations between disciplines and uses in our sample.Most of these associations are for data uses that fall within the research execution phase of the above typology, particularly using data as inputs and for calibration, and domains that typically make use of modelling or computational research methods.
While researchers in multiple domains may use data for the same purposes, uses will be enacted in different ways and have different meanings in various disciplines and contexts (Borgman, 2015b, Leonelli, 2015).While data verification was selected at the same percentage by both astronomers and arts and humanities respondents (9% of responses), for example, the practice and meaning of verifying data will be different in each of these disciplines.RQ 3: How are people discovering data?
Respondents believe that discovering data is a sometimes challenging (73%) or even difficult (19%) task.The greatest challenge participants face is that data are not accessible (68% of respondents), followed by the fact that data are distributed across multiple locations (49%).One third of respondents identified inadequate search tools, a lack of skill in searching for data, or the fact that their needed data are not digital as being additional challenges.

Via academic literature
Thirty percent of respondents reported no difference in how they find literature and how they find data.
Fifty-two percent stated that the two processes are sometimes the same and sometimes different; whereas they were always different for 18%.Respondents saying that the two processes are sometimes or always different (n= 1178) were asked to explain the differences in an open response question.
One of the key differences participants identified between literature search and data discovery are the sources that are used.
I check other channels for data than for literature, e.g. if a project produces data, I check the project's site directly for their data and hope for links to repositories.(Respondent ID 4001) Academic literature could be found through different portals...To receive data, one often needs to know where to find it.For example, the name on the database and then contact the administrator for the database if you can't extract the data directly from the database.
(Respondent ID 4008) Yet the academic literature itself is a key source for discovering data for researchers, as are general search engines (e.g.Google) and disciplinary data repositories (Figure 8).Research support professionals rely less on the literature, more frequently turning to a variety of sources in their search for data (Figure 8).The distribution of sources presented in Figure 8 generally holds across disciplines for researchers, although there are some differences, such as in economics, where governmental sources of data are used more frequently than the literature, or in computer science, where 72% of respondents occasionally or often consult code repositories.
Figure 8. Sources used to find data by researchers (including students, managers, and others, n = 1630) and research support professionals (n=47).Percents represent percent of respondents for each category.Listed in order of decreasing importance for researchers.
Respondents use literature as a source of data -plucking data from reported tables and graphs -but they also use the literature to track down the original data, making use of behaviors common in literature searching, such as citation chasing (Figure 9).It is common for respondents to first find the literature, and then use the literature as a gateway to locating the data.This strategy is often planned, but it also happens serendipitously while reading or searching for literature.Roughly two thirds of participants also often or occasionally find data serendipitously outside of the literature (e.g. via email or conversations with colleagues) or in the course of sharing or managing data.
Finding data is different because it often occurs as a result of finding academic literature.
(Respondent ID 738) Literature is more direct; data is more like "bonus" finds.One finds interesting data in other contexts of work in a publication, one can contact the author to ask for the data.(Respondent ID 3179) Figure 9. Strategies for using the academic literature to discover data.Question asked to respondents who indicated using literature as a source.Percents are percent responses to the question; multiple answers were possible (n=4135).

Via social connections
Using social connections and personal outreach to discover and access data is another important difference identified between literature search and data search.This is reflected in Figure 8 where only 15% of researchers never make use of personal networks in data discovery.
Unlike academic literature where you get the data by accessing the journal, finding data often requires contacting the institution that created the data.(Respondent ID 2357) I use personal networks and public access datasites to discover data, then I usually have to submit a proposal and get it accepted in order to get access to the data.I have not had the experience of just downloading data directly without going through a permission process.
(Respondent ID 1416) Attending conferences and having discussions within personal networks are the most frequent ways of mobilizing social connections to discover data (Figure 10).While personal networks remain important in actually gaining access to data, contacting data authors directly is the most often reported method for accessing data.Forming new collaborations with data creators also appears to be more important in accessing data than in first discovering them.These patterns hold, regardless of the types of data that respondents need or their intended use for the data, although there are some disciplinary differences in the percentage of respondents discovering or accessing data via conference attendance or forming new collaborations.The need to use personal connections in accessing data also reflects the finding that access remains the largest hurdle for participants.

Via "mediated" search
For some respondents, actually locating and accessing data is a mediated process, mediated not through the work of information professionals (although this sometimes happens -see Figure 8), but rather through the literature and through personal connections.Numerous respondents first discover or encounter data via an "intermediary" source -an article, a conversation with a colleague; they then turn to another source -a data repository, Google -to search specifically for the known data.
I generally do not search for data blindly but I would normally know that it already exists through some previous interaction (reading scientific publication, personal communication).
(Respondent ID 679) General search engines (e.g.Google) can also serve in intermediary roles, as respondents use them not only in order to find data themselves, but also to locate data repositories.These two practicesusing Google for known-item searches and to locate repositories rather than data -could contribute to the fact that 39% of respondents found their searches to be either successful or very successful.
However, the majority of respondents using general search engines reported mixed success (54%), with 7% being rarely or never successful, perhaps reflecting the higher failure rate in general in academic search compared to general web searching (Li, Schijvenaars, & de Rijke, 2017).

Via specific searches plus casting a wide net
Much of data searching is very specific.Participants rely on particular, known data repositories and sources.Respondents have specific requirements and search parameters, and seek data for specific purposes and goals, as is evidenced in our typology of data (re)use (Table 4).This is in contrast to literature searching, where participants report using cross-disciplinary sources, such as the Web of RQ 4: How are people evaluating and making sense of data for (re)use?

By using varied evaluation criteria and sensemaking strategies
Respondents require a variety of information about the data and make use of different sensemaking strategies (Figure 11).Eighty-nine percent of respondents reported that information about data collection conditions and methodology was important or extremely important in their decisions; information about data processing/handling as well as topical relevance were also ranked highly (Figure 11a).The ease (or difficulty) of accessing data is also very important to 73% of participants.
While respondents take the reputation of the data creator into consideration, the reputation of the source of the data (e.g. the repository or journal) is slightly more important.Other information identified in open responses includes the timeliness of data, prior usage, and the cost of obtaining data, which can determine the type of research that is pursued.
Patent data is free to access.Data on company deals and revenue can sometimes be paid.
That requires seeking research funding and typically delays the process.Using publicly free data is quicker.(Respondent ID 3220) The academic literature plays a key role not only in discovering data, but also in understanding them.
Respondents consult associated articles, as well as data documentation and codebooks (Figure 11b).
Nearly three-quarters of respondents report engaging in exploratory data analysis, i.e. statistical checks or graphical analysis.Participants also report triangulating data from multiple sources as a way of understanding and determining the validity of data (e.g.Respondent IDs 3131, 2444IDs 3131, , 1949)).

By using social connections in sensemaking
Fifty percent of participants reported conversations with personal networks as being key to making sense of data.Conversations with networks are used more often in sensemaking than in either discovering or accessing data (Figure 10).Contacting data creators to make sense of data does not happen as frequently as discussions with personal networks.
Respondents also attend conferences and form new collaborations to make sense of data (see Figure 10).Some variations in the pattern in Figure 10 for sensemaking exist across disciplinary domains (e.g.see Figure 12), although engaging in conversations with personal networks is almost always chosen most frequently.These variations are likely the result of different disciplinary norms and infrastructures, which influence patterns of collaboration and communication (e.g. the role of conference attendance and publishing norms, or the existence of disciplinary mailing lists).

By using different contextual information for different purposes
Different data uses are associated with needing different information about data (see Figure 13).In Figure 13, we classify the evaluation criteria presented in Figure 11a into content-related information (e.g.data collection methods and conditions, the relevance of data to a topic, the exact coverage of the data), structure-related information (format, size, the existence of detailed documentation and metadata), access-related information (ease of access, licensing) and social information (reputation of data creator and source, knowing the data creator).We then plot significant associations which exist between uses in the (re)use typology and these evaluation criteria.
This analysis allows us to begin to identify the types of information needed by respondents in different research phases.It also allows us to identify gaps.Most of the detected associations occur between content-related information and data uses in the project creation/preparation or analysis/sensemaking stages of our (re)use typology.The fewest number of significant associations were detected with calibration and benchmarking.Although only one association was found for structure-related information, between teaching and data format, the importance of source reputation spans nearly all research phases.

By establishing trust and data quality
The transparency of data collection methods, followed by the reputation of the source and a minimum of errors are critical in trust development (Figure 14a).In some disciplines, a completely error-free dataset may actually raise suspicions, though, as expected errors can be a sign of validity or human involvement.Establishing data quality also depends heavily on the absence of errors and data completeness.Both developing trust and determining quality involve social considerations (Yoon, 2017, Faniel & Yakel, 2017).Although respondents across disciplines consistently ranked having a personal relationship with the data creator as being unimportant in establishing trust, they still weigh other "social" factorsi.e.thinking about human involvement in data creation or the reputation of the source -in their decisions.The reputation of the data creator appears to be more important to respondents when evaluating data quality than in trust development.

Limitations
Both the survey data and our analysis methods have limitations.Our data are descriptive, not predictive, and only represent the practices of our respondents -a group of data-aware people already active in data sharing and reuse and confident in their ability to respond to an Englishlanguage survey.The statistical test used to identify associations errs on the conservative side; it is possible that some associations were not detected with this method.Nonetheless, especially given the large size of our sample, our results are of interest to those wishing to understand the practices of data seekers in academia.

Discussion
We identify and apply four analytical themes to further discuss our quantitative and qualitative findings about practices of data seeking and reuse.We also consider each theme's relation to recent efforts to increase the discoverability of research data before concluding by suggesting future areas for both practical and conceptual work.

Communities of data seekers
The term "community" is often used without considering how communities are formed or their exact composition.Community boundaries are shifting and porous, rather than fixed and stable, and individuals often belong to multiple communities simultaneously (Birnholtz & Bietz, 2003).
We see this clearly in our data.Although communities are typically thought of in terms of disciplinary domains, more than half of our respondents identified with multiple disciplines.Fifty percent also indicated needing data outside of their domains of expertise, perhaps reflecting funders' efforts to promote interdisciplinary research (Allmendinger, 2015).Data communities can also be thought of in terms of the type of data that a particular group uses (Cooper & Springer, 2019;Gregory et al, 2019b).
However, we show here that respondents need multiple types of data for their work, and that these data needs can be difficult to classify in broad terms.
Communities can form around particular methodologies and ways of using and working with data (Leonelli & Ankeny, 2015), as is the case, e.g. in the digital humanities and sociology or economics (Levallois, Steinmetz, & Wouters, 2013).Our (re)use typology (Table 4) allows for conceptualizing data communities in terms of broad uses of data (e.g. using data in conducting research) and in more specific terms (e.g. using data for calibration).We also found that particular data uses are associated with needing certain information about data (Figure 13).Content-related metadata, such as information about collection conditions and methodologies, is important in preparing for new projects; structure-related metadata, e.g.format, is important in teaching.We saw a similar relationship, particularly for teaching data science, in our qualitative data.This suggests another way of conceptualizing data seeking communities -by broad research approaches.Individuals relying on data science techniques, no matter their discipline, may rely more on structure-related information when evaluating data for re(use); content-specific considerations could be more important for more traditional research approaches.
Open data policies and guidelines recognize the importance of communities, but often equate communities with disciplinary domains.The FAIR data principles, e.g.(Wilkinson et al., 2016), call for the use of domain-relevant community standards as well as relevant attributes to facilitate findability and reuse.Our analysis encourages a multi-dimensional way of thinking about communities, recognizing that community-relevant metadata can be defined by considering other factors (e.g.data use) in conjunction with disciplinary domains.

Interwoven practices
Data discovery is interwoven with other (re)search practices, particularly searching for academic literature.Roughly eighty percent of respondents stated that their practices for finding data and literature are either sometimes or always the same.The academic literature itself is the go-to source for finding data for the majority of participants.Despite the immature state of data citation practices in many disciplines (Robinson-Garcia et al, 2016), respondents use a strategy common in literature searching -following citations -to locate data from the literature.
Most data citations indicate some type of data "usage" (Park & Wolfram, 2017), but little is known about why people cite data or the details of how data have been used in a work (Silvello, 2018).This presents an additional challenge for people seeking data to use for a particular purpose.A citation model that typifies data uses could aid in the discovery and evaluation process and could also add value to the multiplicity of data uses that we observed.
The use of persistent identifiers (PIDs) for data, particularly within data citations, (e.g.Fenner et al., 2019) is key to sustainably linking data with related literature.These technical solutions do not exist in isolation from social and economic factors, however; the persistence of PIDs relies on the long-term sustainability of the organizations and infrastructures assigning them (see Bilder, Lin, & Neylon, 2015;Lambert, 2019).While we believe that solutions should be built on existing practices, we also recognize the need to be aware of the limitations of current archiving, publishing and search infrastructures when doing so.

Social connections
Discovering and accessing data are also mediated by personal networks.Respondents find out about data from their connections and then hunt the data down digitally.This process also occurs in reverse -respondents find data digitally and then access the data by personally contacting data creators.The use of social connections in discovery and sensemaking is intertwined with discipline-specific practices of communication (e.g. the role of conference attendance) and collaborations.
Participants identify using social connections as an important difference between searching and accessing data, as opposed to literature.This difference could be a result of the fact that infrastructures to support data search and access are still in development.It could also be due to the complexity of the sociotechnical issues surrounding data access.Researchers question which data to make available for whom (Levin and Leonelli, 2017), and sensitive data containing private information about participants cannot be made openly accessible.
Accessible data, as defined by the FAIR principles, do not necessarily need to be open data (Mons, et al., 2017).Access to data can be mediated by automated authorization protocols (Wilkinson et al, 2016), but automatic denials of access may not mean that data are completely closed to a human data seeker.Researchers can still contact data authors directly if access is denied to learn more about restrictions and possibly form collaborations that would enable reuse (Gregory et al, 2018).Our results also show that ease of access is a top consideration in using data, especially during early phases of research (Figure 13).As certain data become easier to seamlessly and automatically access, other data, those that are more challenging to access, will likely not be used as often, which will shape the research that is or is not pursued.

Practices in flux
Practices and infrastructures are closely linked (Shove, Watson, & Spurling, 2015); this is especially true for practices of data discovery and reuse.We see this in the tension that we found between specific and haphazard search practices.For some respondents, data infrastructures are still in a state of development, which requires casting a wider net to locate appropriate sources.For others, finding data involves going directly to a particular, well-known data repository in the field.
Data infrastructures consist of assemblages of policies, people, technology and data (Borgman et al., 2015, Edwards, 2010).As data are described in more standardized ways, repositories will have different methods of structuring data, be linked to other data and repositories more seamlessly, and will build new services.These services and linkages will change how data seekers interact with and discover data.
Innovations combining new technologies with existing practices will not only alter current practices, but will also bring new considerations to the forefront.Executable papers (e.g.Gil et al, 2016) where readers can interact with data directly, build on linkages between literature and data searching as well as the importance of exploratory data analysis in sensemaking.They also blur the line between where a paper ends and data begins.As the boundaries between data and papers become less defined, the importance of archiving those data in sustainable ways (e.g.Vander Sande, Verbogh, Hochstenbach, & Van de Sompel, 2018) and questions of properly citing data creators, rather than paper authors, will become more visible.

Conclusion
Possible practical applications for this work are many.Designers of data discovery systems could use the associations we detected between types of data and data uses as a starting point to develop recommender systems.They could also work to integrate data search tools with other research and data infrastructures, which we recommend as well in (Gregory et al, 2019a).Developers of data metrics could incorporate the evaluation criteria and types of data uses as possible attributes when creating measurements; librarians could build on our findings about sensemaking to design trainings.
There is also room for further conceptual work.Designing useful, sustainable tools and services requires considering the interconnections between different practices, infrastructures and communities that we have begun to investigate here.Further conceptual work needs to be done to create ways of highlighting these connections in a way that can be easily communicated and that can practically inform design.
q Other.Please specify ____________ Q8b How successful are you at finding data with a general search engine (e.g.Google)?
Please select one answer m Very successful m Successful m Sometime successful, sometimes not m Rarely successful m Not successful Q9: How frequently do you find data in the following ways?
Please select one answer per row Often Occasionally Never By actively searching for data in an online resource Serendipitously, when searching for something else (e.g. when looking for journal articles or news) Serendipitously, when NOT actively looking for something else (e.g. via an email notice or interaction with a colleague) In the course of sharing or managing my own data Q10: Please indicate if you use the following to discover, access, or make sense of data.Please write your answer in the box below: Q12: How easy is it to find data?

Please select all that apply
Please select one answer m Easy m Sometimes challenging m Difficult Q12a: Why is it challenging to find the data that you need?Please select all that apply q The data are not accessible (e.g.behind paywalls, held by industry).q I don't know where or how to best look for the data.q The data are located in many different places.q The data are not digital.q Online search tools are inadequate.q I do not have the personal network needed to find or access the data.q Other.Please specify ____________   Please select all that apply q Students q Researchers q Industry employees q Other.Please specify ____________

L5: How do you support people with their data needs?
Please select all that apply q I teach people about data management planning (e.g. through consultations, workshops, etc).q I teach people how to discover and evaluate data (e.g. through consultations, workshops, etc.).q I find data for people.q I help people to curate their data.q I find literature for people.q Other.Please specify ____________

Figure 1
Figure 1.a) Number of respondents by country of employment (n=1677).b) Percent of recruited participants by country compared to percent of respondents by country (n=1677)

Figure 2 .
Figure 2. Respondents beliefs about how they and other groups feel about data sharing and reuse (n=1677)

Figure 3 .
Figure 3. Disciplinary domains selected by respondents (n=1677).Node size represents the number of respondents selecting each discipline.Edges represent number of respondents selecting both of the connected disciplines.Colors represent groups of participants who selected common multiple domains.

RQ 2 :
What data are needed for research and how are those data used?Data needs are diverse and difficult to pigeonhole.

Figure 4 .
Figure 4. Question from survey with descriptions of data types.Node size in visualization represents number of respondents selecting a data type.Edges represent number of respondents selecting both of the connected data types.Color represents data type.Number of respondents selecting each data type and multiple data types shown in parentheses.(n= 1677).

Figure 5 .
Figure 5. Significant associations between disciplinary domain and needed data.Associations detected using adjusted Bonferroni test for simultaneous pairwise marginal independence (significance level: p < 0.05, n=1677).

Figure 6 .
Figure 6.Reasons why respondents need secondary data.Multiple responses possible.Percents are percent of responses to question (n = 8839).

Figure 7 .
Figure 7. Significant associations between disciplinary domain and data use.Associations detected using adjusted Bonferroni test for simultaneous pairwise marginal independence (significance level: p < 0.05, n=1677).

Figure 10 .
Figure 10.How respondents make use of social connections in discovering data (n=3311), accessing data (n=3589) and making sense of data (n=3031) .Percents are percent responses for each option; multiple responses possible.

Figure 11
Figure 11.a) Information used in evaluating data for reuse (n=1677).Percents are percentages of respondents b) Sensemaking strategies (n=1677).Percents are percentages of respondents.

Figure 13 .
Figure 13.Significant associations between types of data use and information important in evaluating data.Associations detected using adjusted Bonferroni test for multiple marginal independence (significance level: p < 0.05, n=1677).Colors represent phases of (re)use typology.Shapes of points and brackets represent classifications of evaluation criteria. .

Figure 14
Figure 14.a) Importance of criteria used to establish trust in secondary data (n=1677).b) Importance of criteria used to establish quality of secondary data (n=1677).
Lack of errors would not necessarily help establish trust -errors are normal, so a perfect dataset without errors might be a fabricated dataset.I would expect to see standard errors, within expected parameters.(RespondentID 3970)    If you know the field you also know what to look for with respect to unreliable data.Sometimes the occasional error actually speaks to the reliability of a dataset: It indicates a person was involved somewhere in data entry.(RespondentID 1648) mailing lists or discussion forums Q11: Do you discover data differently than how you discover academic literature?Please select one answer m Yes m Sometimes m No Q11a: How is your process for finding data different than your process for finding academic literature?
other information you consider when deciding whether to use or not secondary data.Please write your answer in the box below: Q16: Please indicate the importance of the following in helping you to establish the quality creator Detail or amount of work done to prepare data Consistency of formatting Please specify any other important aspects you consider to help establish the quality of secondary data.Please write your answer in the box below: Part 4: Demographics You are nearly at the end of the survey.Below are some questions to help us classify your answers.D1: How many years of professional experience do you have in your field?
comments: Do you have anything else that you would like us to know?Please write your comments in the box below: Additional Questions Asked to Research Support Professionals L3: Do you use or need secondary data for your own research or to support others?Please select one answer m For my own research m To support others m For both my own research and to support others L4: Who are the people whom you support?

Figure
Figure 16a.P-value table forTable 4: Significant associations between types of data use and needed data type and other data uses.Values over 1.0000 are rounded to 1.0000; values under .0001are rounded to 0.0000.

Table 1 .
Survey questions addressed by each research question."L" questions were asked only to librarians/research support professionals.Multiple response questions indicated in bold; open response questions in italics.

Table 2
. Multiple response questions, variables, and response options of particular interest in our analysis.

Table 3 .
Role, place of employment and years of professional experience of respondents (n=1677).

Table 4
. Significant associations between types of data use, needed data type and other data uses.Grey areas represent uses that could fall within multiple bordering research phases.Colors correspond to research/work phases.Associations detected using adjusted Bonferroni test for simultaneous pairwise marginal independence (significance level: p < 0.05, n=1677).
Science or Scopus, and where the goal is often to cast a wide net to discover ideas for use in theory or concept development.WhenI search for data I am pretty focused on finding only data sets that I need for a specific purpose.WhenI search for literature I read papers that are only peripherally related to the subject but they help me formulate new ideas.(RespondentID2128)Itendtosearch for data by specifying parameters e.g.geographical and date coverage, or by looking for data created by a specific organisation.My literature searches are more general and don't have so many search filters applied.(RespondentID3688)Incontrast,searchingfordata is haphazard and less systematic than literature search for many respondents, requiring researchers to cast a wide net to discover distributed data.It [data search] is a little more haphazard, as I am not as comfortable with finding data.Some of this stems from my not knowing "the" sources, but some of it is also because the finding tools are not yet available.Many times, it is a "try and see" approach.(RespondentID3803)Withscientific literature I know for sure where to look for it, before I start the search.In other words, sources are known to me and do not change for years.With data it is always not so.I may find data in unexpected places.(RespondentID 3152) This state of development causes search practices to be in flux, as individuals figure out how best to find and access the data that they need.In contrast to their practices for searching literature, their data search practices are still in formation.Finding academic literature is part of everyday practice.The processes for finding literature are well-established and institutionally-supported. If I need to find data I have to establish my own process to locate where they are held, get permission from owners, agree access rights etc.(Respondent ID 696)

Table 4 :
Significant associations between types of data use and needed data type and other data uses.Values over 1.0000 are rounded to 1.0000; values under .0001are rounded to 0.0000.

Table 4 :
Significant associations between types of data use and other data uses.Values over 1.0000 are rounded to 1.0000; values under .0001areroundedto0.0000.Figure17.P-value table for Figure7: Significant associations between disciplinary domain and data use.Values over 1.0000 are rounded to 1.0000; values under .0001are rounded to 0.0000.