Skip to main content
SearchLoginLogin or Signup

The Case for Data Archives at Journals

Published onJul 27, 2023
The Case for Data Archives at Journals
key-enterThis Pub is a Commentary on


I have been on the editorial boards of many different journals for more than 10 years and attempting to publish in journals for much longer. That experience has made me question the editorial process and consider how to improve academic publishing. In July of 2021, I became editor of Economic Inquiry (EI) and was in a position to implement new journal policies. One of the first policies that I began working on was a data availability policy. I, of course, borrowed liberally from other journals as there were many good models out there to borrow from. When the policy was finalized, we chose to fund a repository on openICPSR for both journals operated by the Western Economic Association International (Contemporary Economic Policy being the other journal) and establish a policy that requires all papers published by EI that include data to publish a data archive on that or a suitable alternative site. I had many discussions along the way to arrive at that policy and here I will explain some of the considerations that helped me to make the final choice.

Why Should Journals Require Data Archive Packages?

The case in favor of a policy to require data archives is quite straightforward. The movement toward reproducible science is necessary not just to maintain the credibility of individual research papers but to maintain the credibility of all academic research. There have been many examples of fraudulent work being published in academic journals over the years. In one high-profile case, Michael LaCour published a paper in Science in 2014, LaCour and Green (2014), that claimed to show that contact with a homosexual individual improved a person’s support for gay marriage proposals. This was a blockbuster finding, highly publicized outside of academia. This made the eventual finding that it was fraudulent also highly public and highly embarrassing, (Oransky, 2015). There are others who have engaged in prolific fraud, such as Diederik Stapel who published many different studies in high-quality journals, all on the basis of faked data (Enserink, 2012; Levelt et al., 2012). Of course, not all fraudulent research involves faked data. Brian Wansink, the former head of a large research center at Cornell University, was forced to retract many problematic articles (Kincaid, 2022). In his case, the data existed, but he engaged in methods to achieve his results that involved “misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship,” according to Michael Kotlikoff (2018), Cornell’s provost at the time. Many of the results from these papers had also been picked up in the popular press and so the findings of research misconduct here were again quite public. Many more examples of these problems can be found on Retraction Watch, and indeed, the fact that such a website exists is a testament to the fact that far too much problematic research somehow makes its way to the pages of scientific journals.

We clearly need to do better, and requiring more transparency in empirical work at journals is a good start. Facing requirements to provide all of the underlying data, explicit details on methods for data collection and code behind a study will undoubtedly deter many cases of fraud. This is because being required to produce the data and make it visible to others would often unmask the underlying fraud quickly and easily. For those who are not deterred, unmasking the fraud is much easier when the data is available for review than when it is not. Further, not only should these requirements reduce these egregious cases of fraud, which thankfully are not that widespread, but they will force all authors to think very carefully through their empirical processes knowing that they will be publicly viewable. This increased scrutiny should improve the quality of all research published in our journals.

Of course, one should not expect that data archives will prevent all cases of faked data or inappropriate research methods. Very clever individuals intent on engaging in fraud might still be able to create very intricate sets of faked data that are difficult to determine to be fake. Some might also still engage in p-hacking or other similar practices to try to generate extra stars in their result tables. The more data we have about what people have done, the easier it is to check and verify both issues. This does not imply that data archives are silver bullet fraud prevention devices. They are more similar to a lock you put on the door to your house. It is there to prevent robbery of convenience or opportunity. Most door locks are barely an impediment to criminals who are intent on getting inside. Despite not being 100% capable of preventing break-ins, door locks are still a good idea as part of a broader attempt to deter criminals. The same is true of data packages as a fraud prevention device. Even with them, there is still the onus on the research community to carefully evaluate all work and try to identify suspicious studies. I would also argue that as scientists we likely need to do better about not overweighting the information content from a single study that achieves very large and unexpected results. By celebrating such studies, we give substantial incentives to people to create such results, by whatever means they can. We might also end up popularizing a result as real and important when that is either due to problematic research methods or just an outlier data sample. Instead, we should see each new study as one piece of information on an issue and not place much weight on a novel finding until multiple teams of researchers can replicate it and can clearly identify the reason for the finding. This approach might actually be even more effective than data policies at decreasing fraud as it would substantially reduce the incentives to produce shocking results. As a journal editor, I cannot enforce others taking that view, but I can require data archives.

Data archives also allow for science to advance more rapidly. In many cases, one research group may wish to build upon work already published in a journal. Reproducing the original work is often an important first step. This step can be difficult absent a data archive of the original paper because without all of the details regarding what the original researchers did, a would-be replicator may have a difficult job reproducing the original work. In one case at my own journal, a paper was submitted that was attempting to build off a previous paper published at the journal. The new paper’s goal was to improve on the estimation process of the previous one. Unfortunately, the new researchers could not reproduce the original results. Their ‘replication’ estimation generated a result both quantitatively and qualitatively different from the original. This makes it difficult to evaluate whether the new approach improved on the prior one or not. That is a problem for the researchers behind the initial finding because it is harder for others to build on it, and it is certainly frustrating for the later researchers who cannot replicate the prior work. Having data archives accompanying published papers can resolve this problem quickly as researchers who wish to build off of the work of others can see exactly what they did to get those results without wasting a substantial amount of time guessing and potentially failing to identify exactly what the previous authors did.

A great example of the reasons that replicating the work of others is often difficult is contained in Huntington-Klein et al. (2021). This paper asked several teams of researchers to take the same raw data as two published papers and try to answer to the same research question posed in those papers. The new researchers had to take the initial data, make all of the choices empirical researchers have to make about processing that data, and specify a final regression to examine the issue. The new researchers found that the original results often did not replicate. In some cases, the replication studies found a different sign on the key effect in question while in others, the magnitude and standard error of the effect were quite different. Importantly, in all cases, the final number of data points considered differed between the original studies and all replications despite all studies starting with the same raw data. The discrepancy in the final results may have been due to the fact that different research teams often made very different choices along the way to the final specification regarding how to treat ‘problematic’ data points. Thus, to really know how a team of researchers arrived at a set of results, one really needs to know more than just what was the nature of the regression conducted, but to know all the small steps along the way to get there from the raw data. Without this detailed level of information, it can be impossible to really understand how two different studies arrived at different outcomes.

It is important at this point to distinguish between two very different, though related, goals of the data availability policies of journals and how data archives may be vetted by journals. The most commonly discussed check that journals may wish to perform about a data archive is simply whether one can use the archive to reproduce the results in the paper. There are services that one can use, such as cascad, which will provide a certification that the code provided by the researchers can produce the results in the paper. Some journals and grant review agencies are considering using services like this to certify the results of a paper and have that serve as the data verification check. This verification is valuable, but it is of limited use on its own. Just knowing that the results can potentially be reproduced is not as valuable as having the full details on how the results were produced. Of course, data editor time is highly valuable and it can be difficult to fully vet that the archive contains all the processing details and to also verify that all the calculations in the code produce what is in the paper. When designing data availability policies, a journal editor needs to be aware of both issues and then allocate resources to prioritize their goal for their data policy. At EI, I can say that we prioritize evaluating whether the archives provide as much detail as possible on how the authors go from raw to analyzed data, though our data editor will also run the code to verify calculations when possible.

Why Should Journals Not Require Replication Packages?

While I find the arguments above convincing it is worth examining the arguments against these policies to determine how strong they are. The first concern of many is that these archives would allow others to copy the work of the published authors. The authors may have spent a great deal of time finding the data involved, merging and cleaning multiple data sets, and otherwise engaging in a great deal of work to put their data together. It may have also taken a great deal of time to implement the empirical methodology for the model in the paper. Many researchers may wish to keep that work for themselves so that they may continue to exclusively exploit that work in future publications. At face value, this argument seems compelling. While I had my own response to this, the most convincing way to frame the counterargument came from Guido Imbens in our roundtable discussion of journal editors on data policies hosted by the Conference on Reproducibility in Economics and Social Sciences (CRESS; Labor Dynamics Institute, 2022).1 He argued that allowing empirical researchers to hide their methods like this is similar to allowing theorists to publish theorems while keeping the proofs hidden. A theorist could mount the same argument that the proof may have taken a long time to work out, requiring the development of special techniques in the process, and they may wish to be the only ones exploiting their methods in future work. We do not, however, allow theorists to avoid providing proofs. We do not simply trust them blindly. Yet empiricists are asking journals to blindly trust them by saying we should accept their results without seeing what actually generated them. I would further argue that, while yes, making your methods and data transparent may allow others to ‘copy’ your work, the proper way to see that is that it allows others to build off of your work. Your work can now form the foundation of the work of others and have greater impact. I would argue that the possibility that it allows others to learn more from your work is in fact one of the main reasons why journals should be requiring these packages. It is not a downside.

Another common concern is that these requirements place an undue burden on authors. One reason to be concerned about that burden from a journal’s perspective is that these requirements may decrease submissions to the journal as authors seek to avoid submitting to journals that impose them. This is a valid concern. At EI, after putting our requirements into place in December of 2021, our submissions dropped off by about 15%. While it is an issue that should be considered, there are reasons not to let this issue dissuade a journal from implementing the policies. Eventually one should expect this effect to go away as more and more journals adopt similar policies. When most of the peers of a journal have adopted similar guidelines, authors will no longer have the ability to evade the requirements. Further, the last journals in a peer group to adopt these policies should expect to receive submissions by any authors who do not wish to make their research methods transparent. As an editor, I do not want to receive those submissions and so I am willing to accept this trade-off. Other editors might make a different judgment call here based on the nature of the requirements at their peer journals.

A different reason to be concerned about this is that increasing the burden for a publication might work against junior researchers who do not have funded research assistants to assemble data archives for them. This means junior researchers have to spend even more time on each publication they do get, limiting their ability to publish more papers and disadvantaging them relative to their senior colleagues. One could therefore see these data requirements as yet one more element of our publication process that favors established researchers who have the resources to comply more easily over their junior counterparts.

This argument is related to the fundamental claim that these data archives do in fact place a significant burden on authors. I am not convinced that they have to. A data archive is a burden on authors only if the authors wait until the end of the publication process to think about the reproducibility of their work. If authors have engaged in their work in a haphazard way prior to acceptance, then it can indeed be a substantial burden to go back and document all of the data manipulation that was done and script all of the analysis. If, however, authors think about these issues when they begin their research, there is no real burden and I would argue that planning your work to be replicable will save time for researchers and improve quality. In my own work, early in my career I did much of my data work by hand. When I would get referee comments suggesting new analysis, I would have to engage in forensic econometrics to first back out what I had done before. This was wasted time. I now fully script all my analysis, and when updating a paper, it goes more quickly because I have a full record of all choices in any analysis. Given that these scripts exist while producing the paper, complying with constructing a data archive takes very little time. As authors begin to take replicability into account when doing their research, the burden of constructing a data archive diminishes. We can push this process along by making sure that replicable research is brought into PhD training programs.

An important issue related to these concerns is how to handle cases of proprietary or confidential data that the authors may not be allowed to provide. Should journals refuse to publish these papers that we cannot verify? Should journals abandon data policies due to this issue? Should there be some sort of carve-out in data policies for proprietary data? The argument for abandoning data policies due to the inability to place all papers under the same scrutiny should not be immediately tossed aside. There is an equity concern here that is nontrivial. Alternatively, if journals give an exemption to authors who use such data, then this might push people toward using secretive data in an attempt to avoid compliance. This incentive is not great for scientific transparency. There is also the downside of the wasted energy searching for proprietary data and the possibility that research questions may be limited to those that can be addressed with such data. Refusing to publish these papers solves that problem, but there are some legitimate research questions that are best answered with nonpublic data. This is a difficult issue and one that should be reevaluated over time to make certain that any policies adopted are not having adverse effects. My own view is that some sort of carve-out is the best compromise where papers with legitimately confidential data are not required to publish it, but that the papers are allowed to be published. Those authors should still be required to provide a full description of where the data came from, how they got it, and how they processed it so that future researchers could potentially follow that same path. Also, there are ways that such data can be temporarily supplied to a data editor for verification without being made public. This private verification step combined with the requirement to provide the full processing information leads to similar requirements for studies with and without confidential data that allows oversight of both, while hopefully not giving authors an incentive to inefficiently seek out research projects with confidential data.

A final notion that some suggested to me is that there is no need for journals to require data archives. Individuals who want to provide their data can do so on their own sites, and if there are professional incentives to do so in this form as a signal of high quality, everyone will do this anyway. Perhaps, but then presumably journal editors would take that into account in editorial decisions. That would just lead to a backdoor way of requiring data archives, but in an ineffective manner. After the paper is published, authors could quickly pull that archive. There would also be no systematic structures for these archives, making them less useful to other researchers. In order to ensure that the data remains available and is provided in a usable format, it is best that journals maintain the archives and require those archives to meet common standards.

Final Thoughts

I believe strongly in the need for transparency in research. In order to preserve and maintain the integrity of all academic research, we need to push for ever-greater transparency in how research is done. Increased legitimacy is a benefit to us all. The main ‘cost’ (if one sees it that way) would be that greater transparency limits the ability of people to publish ill-founded results. While greater transparency may place greater requirements on researchers to engage in more careful and rigorous work, that is a clear benefit rather than a cost.

Of course, the path to this greater transparency norm will not be direct and not all journals will adopt the same standards at the same time. There are some journals leading in this direction, some following, and others lagging behind. There are good reasons for different journals to be in each of those stages. As journals collectively move along this path, it is important to understand that there are reasons different journals might have different policies, but journals should also keep in mind that we should not make these policies unduly burdensome on authors by having idiosyncratic requirements. When creating data policies, I believe it is best practice to attempt to harmonize data policies across journals as much as possible so that a data archive suitable for one journal should also be suitable for another. Such an approach helps to resolve some of the concerns about the burdensome nature of the archives and also allows journals to learn from the practices of others to achieve best results.


The author thanks Lars Vilhuber for the opportunity to discuss these issues in a CRRESS forum.

Disclosure Statement

This article is an expanded version of remarks given in the Conference on Reproducibility and Replicability in Economics and the Social Sciences (CRRESS) webinar series, which is funded by National Science Foundation Grant #2217493.


Enserink, M. (2012, November 28). Final report: Stapel affair points to bigger problems in social psychology. ScienceInsider.

Huntington-Klein, N., Arenas, A., Beam, E., Bertoni, M., Bloem, J. R., Burli, P., Chen, N., Grieco, P., Ekpe, G., Pugatch, T., Saavedra, M., & Stopnitzky, Y. (2021). The influence of hidden researcher decisions in applied microeconomics. Economic Inquiry, 59(3), 944–960.

Kincaid, E. (2022, May 31). Cornell food marketing researcher who retired after misconduct finding is publishing again. Retraction Watch.

Kotlikoff, M. I. (2018, September 20). Statement of Cornell University Provost Michael I. Kotlikoff. Cornell University.

Labor Dynamics Institute. (2022, October 1). CRRESS Session1: Should journals verify reproducibility? (first cut) [Video]. YouTube.

LaCour, M. J., & Green, D. P. (2014). When contact changes minds: An experiment on transmission of support for gay equality. Science, 346(6215), 1366–1369.

Levelt, W. J., Drenth, P. J. D., & Noort, E. (2012). Flawed science: The fraudulent research practices of social psychologist Diederik Stapel. University of Groningen

Oransky, A. (2015, May 20). Author retracts study of changing minds on same-sex marriage after colleague admits data were faked. Retraction Watch.

©2023 Timothy C. Salmon. This article is licensed under a Creative Commons Attribution (CC BY 4.0) International license, except where otherwise indicated with respect to particular material included in the article.

1 of 7
No comments here
Why not start the discussion?