Replicability is the ability to obtain consistent results across studies that use different data sets to address the same hypothesis. This allows us to determine whether a certain effect, e.g. a psychological phenomenon, truly exists in the real world.
Researchers have been studying the replicability of past psychology studies, and their findings have not been encouraging. A 2015 review of 100 experiments concluded that, while 97% of the studies reported finding an effect, this was the case in only 36% of the replications. Likewise, a 2018 paper that attempted to replicate 21 studies published in the prestigious journals Nature and Science reported that only 13 were successfully replicated.
For many effects the scientific community once accepted as fact, researchers have been unable to replicate the previously published results. There are a multitude of possible explanations for the replication crisis, but some of the most likely culprits are publication bias, underpowered studies, p-hacking and hypothesizing after the results are known (HARKing). These have been dubbed the four horsemen of the reproducibility apocalypse.
- Publication bias is the phenomenon that positive results are more likely to be published, are published earlier, and published in more impressive journals. For example, a paper reporting that drinking green tea extends one’s life expectancy by two years would be more likely to be published in an academic journal than one that reports that green tea offers no health benefits.
- The statistical power of a study refers to the likelihood of detecting an effect when one exists. For example, a power of 0.9 means that if I were to run a study multiple times searching for an effect that truly exists in the world, 90% of the time the results would reflect this. However, 10% of the time I might miss it. An underpowered study cannot reliably provide information regarding a real phenomenon. At a power of 0.6, I would have a 60% chance of accurately detecting an effect and a 40% chance of missing it (i.e. obtaining a false negative).
- P-hacking is the practice of collecting or selecting data or analyses that yield the most favorable results. In other words, it involves the conscious or unconscious manipulation of data to reflect a desired outcome. This might take the form of peeking at the data throughout the data collection process and choosing to terminate the process as soon as the results appeal to the researcher.
- Hypothesizing after the results are known means that an explanation is provided for a given finding under the pretense that this was the assumption that was being tested all along. For example, let’s say I hypothesize that people find sentences that follow an A-B-B-A pattern (e.g. do what you love, love what you do) more aesthetically pleasing than sentences that do not (e.g. do what you cherish, enjoy what you do) because sentences that follow an A-B-B-A pattern are easier to parse and we mistake ease of reading for evidence of beauty. But my results show that, while A-B-B-A statements are indeed judged to be more aesthetically pleasing, they are actually more difficult to parse. It would be HARKing if I wrote in the manuscript that I hypothesized that people are more likely to find sentences beautiful when they have difficulty parsing them and suggested that perhaps this is because humans are more readily enamored of things that don’t come easily.
In their attempts to resolve the replication crisis, spurred on by the publicity it has received, researchers have discovered a number of practices they can implement to mitigate the problems outlined above.
One recommendation is the use of registered reports, a practice that has been adopted by over 250 academic journals to date. When a scientific article is submitted to an academic journal, it is first reviewed by experts in the field. Peer review is meant to ensure that the article is of sound quality.
In registered reports, there are two stages of peer review. Stage 1 peer review is undertaken prior to data collection. When results are unavailable for scrutiny, evaluation of the research quality is less likely to be swayed by pre-existing attitudes, allowing reviewers to focus on evaluating the methodology of the research rather than the impact of the results. At this stage, reviewers can provide valuable feedback on any methodological shortcomings, which the authors can implement prior to launching the study. If a proposal is accepted at this stage, the researchers must conduct the study in the manner agreed upon with the reviewers. After the data has been collected and analyzed, they submit a completed manuscript which undergoes a second round of reviews. During Stage 2, reviewers take into consideration the completed paper, including the results and interpretation of the findings. If the researchers have abided by the Stage 1 protocol and have offered a sound interpretation of findings, the paper is then published.
How Do Registered Reports Help Combat Psychology’s Replication Crisis?
- Scientists are only human. Whether intentionally or unintentionally, they can gravitate toward recommending the publication of results with profound implications over those that are less striking. Registered reports minimize this bias because they are guaranteed to be published, regardless of whether the findings are groundbreaking or consistent with the pre-specified hypotheses.
- Second, registered reports tackle inadequately powered studies, an issue that is prevalent in psychological research. Low statistical power can inflate the likelihood of detecting non-existent effects and missing real ones. Stage 1 reviewers can spot this problem in the study proposal and recommend an appropriate power analysis (i.e. an analysis to determine the minimum number of participants needed to address the research question), if one has not yet been done.
- Third, registered reports minimize p-hacking. Simmons and colleagues (2011) report that “flexibility in data collection, analysis, and reporting” dramatically increases the chances of detecting a non-existent effect (i.e. obtaining a false positive). Unfortunately, it is all too easy to convince readers of results that have been statistically manipulated. Registered reports limit such questionable (or “flexible”) practices, as the analysis plan is determined during Stage 1.
- Lastly, registered reports limit HARKing, since their hypotheses are outlined and evaluated at Stage 1. Eliminating the practice of HARKing benefits scientific inquiry, as HARKed papers present an inaccurate model of science in which researchers always correctly predict their results. In reality, scientific research does not progress so smoothly and researchers often experience both failures and successes over the course of their projects. In addition, HARKing can reduce scientific productivity if the failed hypotheses are suppressed or omitted in the manuscript. If this information is unavailable to the scientific community, future researchers are likely to traverse the same paths that ultimately lead to a dead end.
A recent paper reports that only 58% of results from registered reports were successfully replicated. While this is certainly more promising than what we observe for conventional articles that do not follow the two-stage format, there appears to be room for improvement.
What Else Can Researchers Do?
Open scientific practices such as posting anonymized data, full study materials and analyses in an open repository would contribute to a shift toward transparent and credible scientific practices. Not all authors share their full collection of study materials and instructions. To conduct a direct replication of a given study, one needs to have access to this information. Moreover, posting data and analyses online demonstrates that the authors have nothing to hide and allows others to confirm their work.
Preregistration of studies can also minimize questionable research practices. As with registered reports, preregistration entails outlining the hypotheses, methodology and analysis plan of a study prior to the start of data collection. This document is then shared on a website such as Open Science Framework. Sharing this information publicly inspires greater accountability among researchers.
Preregistration can seem intimidating to young researchers like myself because of concerns about what will happen if you don’t get the results you are hoping for. But this is precisely why preregistration is so important: it eliminates the temptation to engage in the dishonest practices outlined above and encourages scientists to incorporate a thorough, written justification of any deviations from the pre-established plan into their manuscripts.
In the very first research project I was involved in, four studies with different participant pools were conducted in an attempt to address the same research question. Three confirmed our hypotheses, but one did not. I asked my undergraduate advisor if we could just ignore that one study. Given the oversaturation of positive results in the psychological literature, I had the warped idea that there must have been something wrong with it and therefore it wasn’t worth writing about. My advisor reassured me that this is how science works and that all our results should be shared so our readers would have the necessary information to decide for themselves what to make of the project. This is the attitude I have adopted ever since. While it was tempting to pretend the failed replication never happened, it would have been deceitful. Picking and choosing what to report in order to fit a desired narrative does not contribute to the advancement of psychological inquiry. Such practices have fueled the replication crisis that is now casting doubt on the credibility of our entire field.
Every researcher has a responsibility to uphold the integrity of science with honesty and transparency. This can be achieved through registered reports or preregistration, or simply by posting anonymized data, analyses and study materials in open repositories. If we take these steps, I believe we can begin to resolve the replication crisis over the coming years and restore the trust in the psychological sciences that many have understandably lost.
I like the clarity you bring to the replication crisis, and I wish that more researchers would have acted like your advisor. But it’s still encouraging to see that your advisor’s behavior ‘rubbed off’ on others:) I do, however, have one thing to nitpick about. You wrote: “Low statistical power can inflate the likelihood of detecting non-existent effects and missing real ones” and I think it confuses how statistical power relates to positive predictive value (ppv), by linking power to false positive rate (α) instead. Having higher power does not affect α (beyond helping meet parametric assumptions). However, low power does decrease the prevalence of ‘true positives’ in a scientific literature, and therefore makes a random positive result less likely to be true. A good source on the topic is Button et al in ‘Power failure: why small sample size undermines the reliability of neuroscience’. To quote: “Three main problems… Read more »
Thanks for a very educational essay. But is harking all bad? Ok, I was testing hypothesis A and found it failed, but hypothesis B floats out of the data. Anything wrong with that? True, I can’t replace A with B by slight of hand and claim a ‘win’, OTOH, surely I can now pursue B?