Access to timely and high-quality granular data is increasingly becoming a key factor for research and evidence-based policy-making. For accessing confidential administrative data, the introduction of research data centers (RDCs) has been a success story. RDCs are restricted-access facilities, often at the premises of the data owner, that provide accredited researchers with safe access to sensitive granular data.
Although the benefits of an RDC are undisputed, this approach also entails significant costs for all stakeholders. Successful data sharing approaches therefore need to strike a balance between costs and benefits for all stakeholders. We present the BUBMIC model (BUilding Blocks for enabling MICro data access), which is intended to help inform all stakeholders’ decisions regarding whether and how to make confidential data available for research.
Furthermore, we argue that measuring value (and costs) of data sharing is not straightforward in the field of economics. Part of this can be attributed to the absence of established measures. Better measures would ease communication of value and would ultimately lead to more data being pulled out of silos and made available for analysis.
Keywords: democratizing data, data sharing, research data center, usage measures, social science
Access to timely and high-quality granular data is increasingly becoming a key factor for research and evidence-based policy-making. For accessing confidential administrative data, the introduction of research data centers (RDCs) has been a big success story. RDCs are restricted-access facilities, often at the premises of the data owner, that provide accredited researchers with safe access to sensitive granular data.
RDCs entail significant costs for all stakeholders. Data producers must make upfront investments without knowing whether the data will even be used. They also have to take on the risk of providing confidential data for analysis. If the data are only accessible at safe centers, users must incur travel costs for long on-site visits without knowing whether their analyses will generate any significant output.
Data sharing approaches therefore need to strike a balance between costs and benefits for all stakeholders (Bender et al., 2023). A successful approach implements efficient procedures to safeguard confidential data and minimize disclosure risk. Furthermore, it also strives to maximize users’ data utility, for example, by bringing down the costs of data access to avoid unintentionally excluding certain researchers (e.g., PhD students) from using the data.
We present the BUBMIC model (BUilding Blocks for enabling MICro data access), which is intended to help inform all stakeholders’ decisions regarding whether and how to make confidential data available for research. The model describes the costs and benefits throughout the life cycles of research projects that use confidential data. It is important to capture the full life cycle, as the costs and benefits are not distributed equally among all shareholders or across all phases of the life cycle.
Although the benefits of an RDC are undisputed, there have been few attempts in the literature so far to measure its value. In the field of economics, this can partly be attributed to the absence of established measures. Better measures would ease communication of value and would ultimately lead to more data being pulled out of silos and made available for analysis.
Research data centers (RDCs) represent a generally recognized operational approach to providing access to confidential data for research in the public sector. This approach entails significant costs for all stakeholders. Data producers must make upfront investments without knowing whether the data will even be used and also take on the risk of providing confidential data for analysis. If the data are only accessible at safe centers, users must incur travel costs for long on-site visits without knowing whether their analyses will generate any significant output.
Data sharing approaches therefore need to strike a balance between costs and benefits for all stakeholders (Bender et al., 2023). A successful approach implements efficient procedures to safeguard confidential data and minimize disclosure risk. Furthermore, it also strives to maximize users’ data utility, for example, by bringing down the costs of data access to avoid unintentionally excluding certain researchers (e.g., PhD students) from using the data.
In this article, we present the BUBMIC model (BUilding Blocks for enabling MICro data access), which describes the costs and benefits throughout the life cycles of research projects that use confidential data. The model is intended to help stakeholders decide whether and how to make data available for research. We observe that more work needs to be done in order to produce reliable measures for the value of data sharing for stakeholders in an RDC context.
RDCs1 are usually based on the “Five Safes” framework (Desai et al., 2016; Green & Ritchie, 2023a; Ritchie, 2017), which assesses the effectiveness of control measures (typically a combination of anonymization techniques and organizational measures) against risks to the maintenance of confidentiality and the protection of reporting agents. It differentiates between safe projects (is the use of the data appropriate?), safe people (can the researchers be trusted to use the data in an appropriate manner?), safe data (is there disclosure risk in the data itself?), safe settings (does the access facility limit unauthorized use?), and safe outputs (are the statistical results non-disclosive?).
Many papers discuss the costs and risks incurred by data producers when providing access to researchers; the Five Safes framework is a good example of this, as it frames decisions from the perspective of data producers (for example, Griffiths et al., 2019; Green & Ritchie, 2023a; Jang et al., 2023).2 However, in this kind of access model, researchers incur costs and risks, too. Alongside real costs, such as traveling, there are other, often neglected, risk dimensions. These include, among others, potential risk of censorship of undesirable topics by the data producer,3 insufficient data descriptions or data quality, incorrect output checking, and misuse of researchers’ potential analysis ideas by the data producer.
Therefore, when sharing and accessing data, trust must go in both directions: the data producer needs to trust that the researcher is not doing harm to the data, and the researcher needs to trust that the data producer is not doing harm to the analysis. The BUBMIC model is, in our view, a balanced way of describing the costs and benefits throughout the life cycles of research projects that use confidential data. Capturing the full life cycle is important, as the costs and benefits are not distributed equally among all stakeholders or across all phases of the life cycle. In addition, stakeholders may have diverging opinions with regard to the costs and benefits of research.
The BUBMIC model is divided into three building blocks, two of which are dedicated to the workflows of a representative research project. These two building blocks concern the technical and procedural requirements and the safe results, respectively (see Figure 1 and, for a detailed description of the BUBMIC model, see Bender et al., 2023).
In the ‘data preparation’ part of the model, data producers bear most of the costs of data access. They must incur costs for making data ready for analysis by building data pipelines and providing meaningful descriptions. Furthermore, in line with the Five Safes framework, the data provider also must decide on the appropriate level of detail for the data (Green & Ritchie, 2023a). After incurring costs for reading data reports or analyses by other researchers, data users must determine whether the content and level of detail of the data are sufficient for their planned analyses.
Data providers also incur significant upfront costs in the ‘data access’ part of the BUBMIC model. They have to implement technical and organizational measures to safeguard the data, while also allowing researchers to analyze the data in an efficient way. In addition, they need to develop procedures to manage applications and provide guidance to users, if needed.4 Researchers, on the other hand, are often required to complete a significant amount of paperwork to comply with the Five Safe framework and be granted access to the data.
In the ‘data analysis’ part of the model, users must travel to the data producer’s safe environment. Prior to this, they must familiarize themselves with the applicable rules and comply with any additional regulation set out by the data provider with regard to programming or documentation.5 Finally, in the ‘data output’ part of the model, both data users and data producers must incur costs for output checking, as only safe results may leave the safe environment and be published. At the same time, researchers must incur costs for programming to ensure that they obtain the results that they need for their publications.
The third building block of the BUBMIC model explicitly introduces the value for stakeholders from access to confidential data. The concept of ‘value’ described here uses objective criteria and can therefore be measured independently of the data-providing institution. This is important in order to prevent data producers from rejecting ‘undesirable’ research based on their own concepts of value.
In a recent analysis, Blaschke and Hirsch (2023) attempt to quantify the value of data sharing for an RDC.6 Their analysis starts from the idea of applying the cost-utility framework (e.g., Johnston et al., 2006) used, for example, to evaluate investments in clinical trials on public health to an RDC that shares economic and financial data. In their article, Johnston et al. (2006) use this framework to compare the costs of clinical trials and the associated benefits in the form of changes in medical treatment resulting from successful trials. Value is thus determined through a counterfactual analysis (what if only the second-best medical treatment was available?), which is a method that other papers (e.g., Bakker, 2013; Nagaraj, 2022) use when attempting to quantify the value of data.
The literature shows that data users obtain benefits from publication. For example, Swidler and Goldreyer (2002) estimate the lifetime present value of a publication to be between about US$20,000 and $34,000. In this situation, identifying a counterfactual condition involves ascertaining how much a publication would have changed if the user had not have had access to RDC data, but only to the next best data.
Blaschke and Hirsch (2023) argue that it is unclear how much of the value of a publication can be attributed to the data because, in the RDC context, there is no established approach to identifying the counterfactual data.7 This is because the data in a public sector RDC likely comes from entities fulfilling their regulatory requirements. It is thus likely that the data exhibit unique features (e.g., in terms of frequency or granularity), so that only imperfect substitutes exist.8 In fact, this may very well be the reason why the data is so appealing and users incur costs to access it.9
Data producers also need to invest resources in capturing the societal value of providing access to researchers. However, as Blaschke and Hirsch (2023) argue in their paper, measuring the value of research becomes more challenging the further away we move from the research analysis. Indeed, the literature seems to disagree on both the definition and the measurement of societal impact (i.e., ‘end outcomes’) (Bornmann, 2013; Bührer et al., 2022; Lane, 2009; Sørensen et al., 2021).
End outcomes, such as societal impact, are the furthest away from the research analysis and are confined to the non-academic sphere. By contrast, the closest in time to the research analysis is publication (i.e., ‘immediate outcomes’), followed by ‘intermediate outcomes.’ which comprise the dissemination to and use of research in policy and practice. ‘Immediate outcomes’ are thus confined to the academic sphere, while ‘intermediate outcomes’ attempt to measure the value generated from cross-sector knowledge transfer (Sørensen et al., 2021).
Assessing value for economic research intensifies this challenge further “because the creation and transmission of knowledge and technologies result from complex human and social interactions” (Lane, 2009, p. 1274). Indeed, research in economics rarely leads to linear outcomes (Lane, 2009) such as patents or products (Banzi et al., 2011; Rollins et al., 2020) or other easily measurable ‘intermediate outcomes’ regularly used in assessments of health care.
Against this background, Blaschke and Hirsch (2023) take a different and more traditional approach and adopt the ‘payback’ framework (Buxton & Hanney, 1996; Rollins et al., 2020) to evaluate the benefits of RDCs. It should be noted that the payback framework shares the same issues as the naïve approach when measuring value and costs to the public. As a result, Blaschke and Hirsch (2023) only consider the ‘immediate outcomes’ of knowledge production and capacity building (Banzi et al., 2011; Rollins et al., 2020), making their analysis a sort of lower bound.
They start by counting the number of projects that resulted in publication as their proxy for knowledge production. Out of the 849 projects in the Deutsche Bundesbank’s RDC database between 2015 and 2021, 528 are eligible to produce outcomes.10 Of these, 164 (31.1%) led to publication11 and 364 (68.9%) did not. With regard to capacity building, Blaschke and Hirsch (2023) identify master’s or PhD students from their applications and reveal that 127 (24.0%) projects led to a completed degree.
In this article, we present the BUBMIC model, which is intended to help inform all stakeholders’ decisions regarding whether and how to make confidential data available for research. The model describes the costs and benefits throughout the life cycles of research projects that use confidential data. It is important to capture the full life cycle, as the costs and benefits are not distributed equally among all shareholders or across all phases of the life cycle.
This article leaves several interesting areas for future research. First, researchers need to find ways to quantify the contribution of data to a research outcome. For example, it could be argued that unique features of the data, such as higher frequency or granularity, lead to publications in more highly ranked journals. In connection with this, it would be interesting to develop metrics for measuring the impact of publications on policy-making to help quantify the societal impact of data.
Second, while researchers are expected to contribute their efforts toward producing societal value, they sometimes lack the knowledge to extract the full value of their research, for example, through policy debates. Part of this can be attributed to the absence of established measures that they can target. However, it would be helpful to establish new frameworks to help capture the full value of research outcomes. This is the more important aspect, as this societal value is often part of the justification for providing researchers with access to data at all.
This article does not necessarily represent the views of the Deutsche Bundesbank or the Eurosystem. The paper was completed while Jannick Blaschke was at the Deutsche Bundesbank. The authors would like to thank two anonymous referees for their comments that greatly improved the paper.
Stefan Bender, Jannick Blaschke, and Christian Hirsch have no financial or non-financial disclosures to share for this article.
Bakker, C. (2013). Valuing the census. Statistics New Zealand. https://www.stats.govt.nz/assets/Research/Valuing-the-Census/valuing-the-census.pdf
Banzi, R., Moja, L., Pistotti, V., Facchini, A., & Liberati, A. (2011). Conceptual frameworks and empirical approaches used to access the impact of health research: An overview of reviews. Health Research Policy and Systems, 9(1), Article 26. https://doi.org/10.1186%2F1478-4505-9-26
Bender, S., Blaschke, J., & Hirsch, C. (2023). Statistical data production in a digitized age: The need to establish successful workflows for micro data access. In G. Snijkers, M. Bavdaž, S. Bender, J. Jones, S. MacFeely, J. W. Sakshaug, K. J. Thompson, & A. van Delden (Eds.), Advances in business statistics, methods and data collection (pp. 519–536). Wiley. https://doi.org/10.1002/9781119672333.ch22
Blaschke, J., & Hirsch, C. (2023). On the value of data sharing – Empirical evidence from the Research Data and Service Centre (Technical report 2023-08). Deutsche Bundesbank. https://www.bundesbank.de/resource/blob/863758/7ebe74476186cd3364a11b3869ada80a/mL/2023-08-value-data.pdf
Bornmann, L. (2013). What is the societal impact for research and how can it be assessed? A literature survey. Journal of the American Society for Information Science and Technology, 64(2), 217–233. https://doi.org/10.1002/asi.22803
Buxton, M., & Hanney, S. (1996). How can payback from health services research be assessed? Journal of Health Services Research & Policy, 1(1), 35–43.
Bührer, S., Feidenheimer, A., Walz, R., Lindner, R., Beckert, B., & Wallwaey, E. (2022). Concepts and methods to measure societal impacts – An overview. Fraunhofer ISI Discussion Papers Innovation Systems and Policy Analysis No. 74.
Desai, T., Ritchie, F., & Welpton, R. (2016). Five Safes: Designing data access for research. Economics Working Paper Series, 1601. University of the West of England, Bristol. https://www2.uwe.ac.uk/faculties/BBS/Documents/1601.pdf
Green, E., & Ritchie, F. (2023a). The present and future of the Five Safes framework. Journal of Privacy and Confidentiality, 13(2). https://doi.org/10.29012/jpc.831
Green, E., & Ritchie, F. (2023b). Using pedagogical and psychological insights to train analysts using confidential data. Journal of Privacy and Confidentiality, 13(2). https://doi.org/10.29012/jpc.842
Griffiths, E., Greci, C., Kotrotsios, Y., Parker, S., Scott, J., Welpton, R., & Woods, C. (2019). Handbook on statistical disclosure control for outputs. Safe Data Access Professionals Working Group. https://ukdataservice.ac.uk/thf_datareport_aw_web/
Jang, J. B., Pienta, A., Levenstein, M., & Saul, J. (2023). Restricted data management: The current practice and the future. Journal of Privacy and Confidentiality, 13(2). https://doi.org/10.29012/jpc.844
Johnston, S. C., Rootenberg, J. D., Katrak, S., Smith, W. S., & Elkins, J. S. (2006). Effect of a US National Institutes of Health programme of clinical trials on public health and costs. The Lancet, 367(9519), 1319–1327. https://doi.org/10.1016/s0140-6736(06)68578-4
Lane, J. (2009). Assessing the impact of science funding. Science, 324(5932), 1273–1275. https://doi.org/10.1126/science.1175335
Nagaraj, A. (2022). The private impact of public data: Landsat satellite maps increased gold discoveries and encouraged entry. Management Science, 68(1). https://doi.org/10.1287/mnsc.2020.3878
Research Data and Service Centre (2021). Rules for visiting researchers at the RDSC. Deutsche Bundesbank. https://www.bundesbank.de/resource/blob/826176/ffc6337a19ea27359b06f2a8abe0ca7d/mL/2021-02-gastforschung-data.pdf
Rollins, L., Llewellyn, N., Ngaiza, M., Nehl, E., Carter, D. R., & Sands J. M. (2020). Using the payback framework to evaluate the outcomes of pilot projects supported by the Georgia Clinical and Translational Science Alliance. Journal of Clinical and Translational Science, 5(48), Article e48. https://doi.org/10.1017/cts.2020.542
Ritchie, F. (2017). The ‘Five Safes’: A framework for planning, designing and evaluating data access solutions. Paper presented at Data for Policy 2017, London, UK.
Schönberg, T. (2019). Data access to micro data of the Deutsche Bundesbank. Deutsche Bundesbank. https://www.bundesbank.de/resource/blob/801044/6484d9e4aa2be3610b7378a48a1916de/mL/2019-02-data-access-data.pdf
Sørensen, O. H., Bjørner, J., Holtermann, A., Dyreborg, J., Birkelund Sørli, J., Kristiansen, J., & Bohni Nielsen, S. (2021). Measuring societal impact of research – Developing and validating an impact instrument for occupational health and safety. Research Evaluation, 31(1), 118–131. https://doi.org/10.1093/reseval/rvab036
Swidler, S., & Goldreyer, E. (2002). The value of a finance journal publication. The Journal of Finance, 53(1), 351–363. https://doi.org/10.1111/0022-1082.135230
©2024 Stefan Bender, Jannick Blaschke, and Christian Hirsch. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.