We are forced to confront this appalling question by the publication in Science this past August of an article by the Open Science Collaboration, “Estimating the Reproducibility of Psychological Science”. The article reports the empirical results of attempts to replicate 100 studies published in 2008 in three leading psychology journals. The basic finding is that only 39 results out of 100 could be replicated. Moreover, only 47 of the original results were in the 95% confidence interval of the replication effect size. If this sample of 100 studies is representative of the best research in psychological science, then most elite psychological science is rubbish. (A brief and helpful discussion of the article was published at about the same time in Nature.)
The “Open Science Collaboration” was an ad hoc collection of 270 research psychologists assembled by social psychologist Brian Nosek of the University of Virginia through his Center for Open Science, “a non-profit technology company providing free and open services to increase inclusivity and transparency of research.” The three journals in question were Psychological Science, The Journal of Personality and Social Psychology, and the Journal of Experimental Psychology: Learning, Memory, and Cognition. Assignment of articles to replication teams was constrained both by chronological order of publication (to prevent cherry picking or other bias in selection of studies) and by the interest and expertise of the replication teams. Researchers with the appropriate expertise to conduct particular studies were actively recruited. Significantly, the authors of the original studies were consulted and critiqued the replication designs and provided the original stimulus materials where possible.
Thus, although some of the replication failures must no doubt have been due to inadequacies in the designs of the replication studies, it is unlikely that this accounts for the bulk of the replication failures, as some have suggested. Even if 15% of the replication failures resulted from flawed designs, the remaining replication failures would still amount to over half of the original studies. And even if half of the replication failures are thus flawed, the implication would still be that over 30% of what you read in an elite psychology journal is false. Feel any better?
Nor is it any good to claim, as some have, that replication failure is actually a good thing, because it teaches us all the fine specifications and nuances required to generate a given effect. That replication failure can lead to greater insight into the nature and causes of an effect is noted by the Open Science Collaboration and indeed by nearly all commentators on this topic that I have read in the past few days. The trouble is that that is not the sort of replication failure we are confronting.
To begin with, let us note that the replications attempted by the Open Science Collaboration were not so-called conceptual replications; i.e., attempts to generalize an effect to slightly different conditions or applications. For example, consider the well-known finding that self-control is a general but limited resource that can be depleted like a rubber band–driven propeller. Suppose that this finding was originally demonstrated by an experiment in which hungry participants were either allowed to eat from a tray of sweets or asked to resist the available sweets and eat radishes instead. Subsequently, the two groups are set to solve difficult—indeed, unsolvable—puzzles. Sweet eaters persist at the puzzles significantly longer than radish eaters, which is explained by the experimenters as being due to the radish eaters having depleted their self-control earlier when they were resisting sweets. Now, a conceptual replication of this original experiment would attempt to extend or generalize the finding by changing the conditions somewhat, say by asking participants to count backwards by threes while annoying loud music plays instead of resisting sweets, or to squeeze a stiff hand grip for as long as they can instead of persisting at puzzle-solving.
Such a replication, if successful, would demonstrate the operation of the putative psychological process in novel circumstances, and it would teach us something about the conditions governing the operation of that process. And if unsuccessful, we would begin to learn something about the limits to generalizability of the underlying process. But this is not the sort of replication that the Open Science Collaboration authors were attempting. Rather, they were attempting to obtain exact replications; i.e., as the name implies, replications in which the exact conditions of the original study are duplicated as closely as possible. In such conditions, the same result, if valid, should usually be observed.
Of course, a failure of exact replication still might only mean that the conditions determining the original result are insufficiently understood—that the original conditions had features whose importance for generating the effect wasn’t realized and therefore weren’t duplicated in the replication attempt. However, there are two problems with this suggestion. First, in many cases it would gut the result of much of its interest. In the self-control example, for instance, the discovery that self-control is a limited resource only for puzzle-solving, not for other forms of work, seriously limits its practical and even theoretical interest. Again, in clinical trials, failure to replicate the efficacy of some treatment may seriously restrict or even altogether destroy its value.
But this is assuming that the original effects can be eventually replicated somehow, only under less general conditions than originally supposed. The more serious worry is that this might not be so, that the original result is simply a mirage that eventually turns out not to be replicable at all. And indeed, this is precisely the phenomenon that has come to be observed, with increasing alarm and in a variety of fields, not just psychology, over the past ten to fifteen years. For example, in 2005 epidemiologist John Ioannidis published in the Journal of the American Medical Association a study examining the fate of all the original clinical medical research studies published between 1990 and 2003 that had garnered 1000 or more literature citations. There were 45 such studies that had found a positive effect. Of these, 7 (16%) were contradicted by subsequent studies, 7 (16%) were able to be replicated only with a clinical effect half the size of the original study or less, 20 (44%) were replicated, and 11 (24%) remained unchallenged (in many cases probably because they were too recent for follow up studies to have been completed). Thus, even in medicine, and the most prestigious medicine at that, about a third of studies have serious replication problems (and over 40% of those for which replication was attempted). The affected studies are important research. Among the major studies that were flatly contradicted by subsequent replication attempts were the “finding” that hormone replacement therapy reduces the risk of coronary artery disease in postmenopausal women and the “finding” that vitamin E reduces the risk of coronary artery disease and myocardial infarction. In every case, the replication studies used larger sample sizes and stricter controls than the original studies. Thus, the replication failures are not likely to be due to laxity in the replication attempts.
Another example, described by Jonah Lehrer in the best single discussion of the replication debacle I have found (ironically in a popular journal), “The Truth Wears Off,” (The New Yorker, Dec. 13, 2010), concerns fluctuating asymmetry in biology. It is a fact that a high number of mutations in one’s personal genome tends to show up as bodily asymmetry, for example as different lengths of fingers on each hand. In 1991, Anders Møller, a Danish zoologist, found that female barn swallows strongly preferred to mate with males that had more symmetrical feathers. This was a spectacular result, since it seemed to show that female sexual attraction in barn swallows had evolved to use body symmetry as a proxy for high quality genomes, and it stimulated a flurry of follow up research. Over the next three years ten more studies were published, nine of which confirmed Møller’s finding in the case of barn swallows or extended it to other species, including humans. But then in 1994, of fourteen attempted replications or extensions of Møller’s result, only eight were successful. In 1995, only half of attempted replications were successful. In 1998, only one third. Moreover, the effect size was shrinking even among successful replications. According to Lehrer’s account, “between 1992 and 1997, the average effect size shrank by eighty percent.”
One more example (also from Lehrer). In the 1990s a series of large clinical trials “showed” that a new class of antipsychotic drugs, including those marketed under the names Abilify, Seroquel, and Zyprexa, strongly outperformed existing drugs at controlling the symptoms of schizophrenia. These drugs accordingly were approved and became big sellers. However, by 2007 follow up studies were showing effects dramatically less than in the original studies of the previous decade, and it is now to the point where many researchers claim that the newer drugs are “no better than first-generation antipsychotics, which have been in use since the 1950s.”
Lehrer gives further examples, and see also the papers cited in Yong (2012), Spellman (2015), and Lindsay (2015). Thus, the replication problem is not really new and not restricted to psychology, though I have the impression it is best documented in psychology, biology, and medicine. What distinguishes the Open Science Collaboration article is not the worry that many or even most new research findings are false, but its presentation of direct experimental evidence to this effect. In short, the epidemic of replication failures appears to be a problem of disappearing findings, not the normal, healthy, “wonderfully twisty” path of scientific discovery. It is the radical diminution or disappearance altogether of findings that turn out to be largely illusory. Cognitive psychologist Jonathan Schooler, who was alarmed to discover this problem in his own research and honest enough to acknowledge it, calls it “the decline effect.” Notably, in so calling it he follows J. B. Rhine, the famous pioneer of research in parapsychology, who also was frustrated by the tendency of his own positive findings to disappear over time.
If it has taken a while for the replication problem to come to the attention of psychologists, one reason is the reluctance of journals to publish replications, especially failed replications. Journals look to publish exciting, new, positive findings. They have an aversion to old news, and they particularly do not welcome studies that throw cold water on hot new findings. Some details on this aspect of the problem are provided in Ed Yong’s Nature piece, “Replication Studies: Bad Copy.” Yong cites the difficulty Stéphane Doyen experienced in trying to publish a failed replication of John Bargh’s famous study of age-related priming. This was the study where Bargh asked participants to unscramble short sentences, some of which contained words related to aging and the elderly, like Florida, wrinkle, bald, gray, and retired. The important finding was that participants thus primed with age-related words walked more slowly down the hall to the elevator after they believed the experiment was over than participants who had not been so primed. (The participants, of course, had not been informed of the true purpose of the experiment. Nothing had been done to explicitly alert them to the question of aging.) This finding, amusingly called the “Florida effect,” has become a classic with 3800 citations according to Google Scholar. Jonathan Haidt and Daniel Kahneman, in their recent books, both take the finding for granted as a fact (Haidt, The Happiness Hypothesis, 2006: 14; Kahneman, Thinking, Fast and Slow, 2011: 53). But I have the impression that there has never been an exact replication (in the above sense) of the effect. According to Yong, Doyen’s failed replication was rejected by multiple journals and finally had to be published in PLoS ONE, a multidisciplinary, open access journal that “accepts scientifically rigorous research, regardless of novelty. PLoS ONE’s broad scope provides a platform to publish primary research, including interdisciplinary and replication studies as well as negative results” (from the journal website). According to Yong, after Doyen’s paper was eventually thus published, it “drew an irate blog post from Bargh. Bargh described Doyen’s team as ‘inexpert researchers’ and later took issue with [Yong] for a blog post about the exchange.”
Now, ungracious reactions to unwelcome results are nothing new in the history of science, and the point is not to single out Bargh but to highlight just how tenuous may be the hard evidence that backs up even the most celebrated findings in a culture that discourages replication. If the Florida effect has never been exactly replicated but only conceptually replicated, and if the conceptual replications have never been exactly replicated either, and if over half of attempted exact replications fail, then how sure do we have a right to be that there is really any such thing as the Florida effect? I want to stress the importance of this question. The Florida effect is not just a psychological curiosity, an isolated finding to which we can take an easy come, easy go attitude. The underlying principle which the Florida effect is taken to illustrate—the breadth and power of associative memory to unconsciously influence our conscious thought processes and behavior—has become one of the architectonic principles of cognitive psychology in the past couple of decades. This is its role in both Haidt’s and Kahneman’s theories, for example. Clearly, it is critically important that the findings that support such principles be facts, not illusions.
It is time to ask: What explains the decline effect? How can it happen that so many carefully produced experimental findings evaporate? Our epidemiologist Ioannidis proposed an answer in a second, quite famous (3259 citations) paper also published in 2005, spectacularly titled, “Why Most Published Research Findings Are False.” The paper was published in a medical journal (PLoS Medicine), and Ioannidis seems to have genetic association studies very much in mind, but he does not qualify his claim that “it can be proven that most claimed research findings are false” by restricting it to any particular field or set of fields. This would be an amazing result, if it could really be proved, but I do not find Ioannidis’s argument, such as it is, very persuasive. He presents a set of statistical formulas, whose derivation he does not bother to present, and—much more importantly—whose assumptions he does not justify or even discuss. (The presentation of formal “results” ex cathedra, before which we are apparently supposed to prostrate ourselves like so many Medes before the Basileus, is an irritating feature of supposedly hard science journals. But I suppose it is déclassé to complain.) Nonetheless, the argument is interesting, which (besides the fact that it has apparently been influential) is the reason I take the trouble to comment on it.
The basic idea is not difficult. It can be put by saying that if the frequency of relations in the world to be discovered by experiment is sufficiently low, then even an experimental method with a seemingly low false positive rate will generate mostly false positives. Thus, suppose that the logical space of variables we are exploring contains 100,000 possible relations, of which only one is actual. And suppose our experimental method is capable of detecting such relations with a false positive rate of one in a hundred tests. Then, roughly, of every 100,000 tests performed, on average 99,000 will return true negative results, 999 will return false positives, and 1 will return a true positive. This means that the ratio of false to true positives is (roughly) 999 to 1. This is not a good ratio! It certainly confirms the assertion that “most claimed research findings are false.”
Ioannidis’s analysis is a bit more complex than what I have described, of course. In particular, it includes a factor for study power, which I am neglecting. But what I have described is the meat of the matter. It depends essentially on the factor Ioannidis calls R, the ratio of actual to possible relations among variables of interest to a given scientific question. Moreover, R does not have to be particularly small to start causing trouble. Suppose that rather than R = .00001, as in the previous case, we have only R = .1. Then, in 100,000 random tests, we should on average encounter 10,000 actual relations and 90,000 nonrelations. If we take the traditional α = .05 significance level as our false positive rate (instead of the .01 of the previous case), then we can expect 4,500 false positive results from our 90,000 nonrelation tests. Even assuming perfect power to detect the actual relations, 4,500 / 14,500 of our positive results, nearly a third, are false.
This is a clever point, which I admit I never thought about before. It is basically the problem of base rate neglect applied to the context of scientific research (see also Tversky and Kahneman, “Evidential Impact of Base Rates,” in Kahneman, Slovic, and Tversky, eds., Judgment under Uncertainty: Heuristics and Biases, 1982: 153–160). But I said I do not find it particularly persuasive. There are several reasons for this. For one thing, it could fairly easily be accommodated by using Bayesian statistics instead of traditional null-hypothesis statistical testing. For another, if this were a serious problem—if today’s typical published study had R = .01, for example (much less R = .00001)—then successful replication would practically never happen. But it does. Many studies are successfully replicated, and let us remember that the decline effect is so named just because effect sizes tend to shrink, not instantly disappear altogether. Thus, Ioannidis’s analysis does not account for the pattern of positive findings and their subsequent decline that we frequently observe.
More importantly, there are good reasons to think that R is usually not small. It may be small indeed in exploratory research of the kind Ioannidis seems sometimes to have in mind. In the practical example he provides, the investigators do a whole genome study to discover whether any of 100,000 genes are associated with susceptibility to schizophrenia. Thus, assuming perhaps 10 genes may be thus associated, we have R = .0001, and obviously this is quite a problem. (And it’s hard to believe for just this reason that genetic association studies really use traditional null-hypothesis statistical testing, but Ioannidis would know about this much better than I would.) But a great deal of research, and most of the sort I am concerned about, is not blindly exploratory in this manner.
For example, consider once again Bargh’s Florida effect. What is R liable to be in this case? How did Bargh decide on the design of his experiment? Was his research question, “What would make people walk slower than usual?”, and did he then choose to test age-related word priming at random from among several hundred possible variables? Certainly not. Rather, he most likely started from the hypothesis that associative memory is a pervasive cognitive structure capable of influencing almost any conscious process. This would be a hypothesis he had strong reason from previous research to suspect is true. From this hypothesis it would follow that semantic priming for aging might associatively affect almost any other process, including walking. Thus, an empirical finding that age-related word priming induces slow walking, which is quite startling a priori, helps confirm a hypothesis whose prior probability (in the Bayesian sense) is not particularly low. In Bargh’s experiment, then, plausibly a hypothesis with a relatively high prior probability (i.e., a relatively large R) is supported by evidence with relatively low prior probability. This is just the sort of condition that makes for strong confirmatory power, which is how Bargh’s experiment is usually interpreted. Most research in cognitive psychology—and I should think in most experimental science per se—is generated in this manner, not in a blind exploratory fashion. If so, then Ioannidis’s clever problem is not a serious threat to most research.
But if Ioannidis’s suggestion does not explain the decline effect, what does? Of the ideas I have surveyed, just two seem really plausible. Both can be summed up by the same word: bias.
The first source of bias is forcefully illustrated in a 2011 paper by Simmons, Nelson, and Simonsohn, “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” The authors name four aspects of research design which investigators typically adjust on the fly, even though such adjustments can seriously increase the risk of obtaining a false-positive result. One of these is the choice of sample size. It is not unusual for an investigator, after collecting a certain number of observations without obtaining a significant result, to collect another round of observations in the hope of obtaining one. This procedure significantly increases the probability of obtaining a false-positive result, but it may not be mentioned in the methods section of the article where the results are reported.
Another flexible aspect of research design is the choice of dependent variables. It is not unusual for more dependent measures to be collected than are reported. The danger of this should be obvious. Imagine that in the Florida effect study, for example, besides walking speed, the investigators also asked participants to estimate the ages of people shown in photographs, measured the time participants spent putting on their coats after the experiment, and asked participants to estimate the weight of a heavy object by lifting it. If these other measures are not found to be significantly related to age-related priming, it may not be reported that they were ever collected. Yet it is clear that the investigators have in effect performed four experiments, not just one. So the risk of a false-positive result is increased by a factor of four, unbeknownst to the reader of the article in which the positive result is reported.
The other two flexible aspects of research analyzed by Simmons et al. were the use of covariates (which may sometimes be employed with little theoretical justification although they significantly alter the strength of the relation between the independent and dependent variables) and the reporting of subsets of experimental conditions (for example, if a treatment was administered at low, medium, and high levels, any of the three possible pairwise comparisons [low–medium, medium–high, or low–high] might produce a significant result even when the linear relation between the three [low–medium–high] does not).
The authors ran a computer simulation in which 15,000 random samples were drawn from a normal population and flexibly analyzed according to the four practices they examined. When all four practices were combined, a significant “publishable” result was produced 60.7% of the time at the .05 level, and 21.5% of the time at the .01 level.
It should be noted—and stressed—that the use of these flexible, discretionary practices in the search for significant relations (and other practices, such as the elimination of “outliers” from a dataset and the mathematical transformation of variables) does not necessarily imply any sort of fraud or malicious intent on the part of the investigator. The practical reality of designing and conducting a study is almost always considerably messier than the model of logical scientific reasoning and methodology presented in the final report. It is common and accepted practice for design and analysis decisions to get made as a study proceeds. (For empirical evidence that the practices studied by Simmons et al., as well as other questionable research practices, are in fact commonplace, see John, Loewenstein, and Prelec, 2012.) Unfortunately, as Simmons et al. show, the sorts of discretionary decisions described here, in conjunction with an investigator’s powerful interest in finding a significant result, can result in a false positive rate far higher than .05. Most investigators probably have little awareness of this impact of their discretionary decision making on the quality—the credibility—of their results.
It is not only investigators who are biased in favor of significant results. Journals provide a second important source of bias. As I mentioned earlier, journals, and especially the more prestigious journals, want to see positive, exciting, novel results. In psychology, many of the best journals simply do not publish exact replications, whether successful or not. In view of the problem with experimenter bias just described, this obviously has the potential to create a very faulty body of “results.” It is not hard to see how a spectacular result, like the Florida effect, once published, might be quickly “supported” by a barrage of conceptual replications, which, in view of the problems of experimenter bias just discussed, might not be too hard to come by. And so excitement and confidence builds around the Florida effect, even though (let us suppose) neither it nor any of its conceptual replications has ever been exactly replicated, because the journals are not interested in exact replication, despite the danger this presents that research programs and major theories in psychology might be constructed on illusory basic findings.
Over time, of course, findings that once were novel and exciting become orthodoxy and therefore fair game for revision and even attack. Novelty and excitement can now be had at their expense. So the biases in their favor become relaxed and research supporting them declines. It might even become interesting to show that they can’t be exactly replicated. The decline effect sets in.
How serious is the replication problem? Speaking for myself, I wasn’t much concerned until I began reading the articles on which this post is based. Until the Open Science Collaboration published their article last August, the arguments and complaints of the stats and methods heads were mostly theoretical and mostly indistinguishable from what any trained psychologist has been hearing from such people since the first year of graduate school. Everybody says they hate null-hypothesis statistical testing and ridicules the arbitrary .05 level, and everybody knows that a criterion of α = .05 for statistical significance means one in every twenty findings (on average) of a “significant” result is false. But null-hypothesis statistical testing goes on because it is easy to understand and perform, and as for the likelihood of false positives, what is the alternative? Halt research? So initially I wasn’t inclined to pay too much attention to navel-gazing meditations on the supposed crisis roiling psychological science for the past five years.
But what I realize now is that the decline effect is different. It is one thing to issue warnings about problems that might result from less-than-pristine research methods. It is quite another to document empirically that a wide swath of research findings, including some that are highly cited, is in fact evaporating. What makes the Open Science Collaboration article so important is that it presents the first hard evidence that exact replications are unobtainable for many and perhaps most research findings published in top psychology journals, so that the methodologists’ worries about the potential of questionable research methods to produce false results needs to be taken a great deal more seriously than psychologists have been inclined to do up to now. If most of what is getting published in Psychological Science and The Journal of Personality and Social Psychology and The Journal of Experimental Psychology is false, then we have a genuine crisis.
The latest issue of Psychological Science has an editorial promising to take the replication crisis seriously and improve reviewing standards at the journal. What would have been more comforting is if the author had simply issued a new set of concrete requirements. (Simmons et al. suggest such a set, which would do for a start.) The good news is that I don’t think this problem is going away. People seem bound to keep pushing on this, and as long as people keep pushing, serious change is bound to come. Stay tuned for further developments.
Postscript added February 2, 2016
The latest APS Observer has a banner across the cover trumpeting “Psychological Science’s Commitment to Replicability.” (APS is the Association for Psychological Science.) Inside is a two-page interview with PS’s interim editor D. Stephen Lindsay—author of the PS editorial I mentioned at the end of my post. On first reading, I found his remarks as underwhelming as I had his editorial. There is the same emphasis on urging researchers to do better, as though brave resolutions to hold ourselves to stricter methodological standards will be sufficient to change behavior in the face of strong institutional incentives to publish exciting results (and none at all to embrace strict methodological standards). This seems particularly obnoxious and unrealistic coming from a senior scientist with nothing much at stake, to the extent that the advice is directed to students, postdocs, and junior faculty who are struggling to have any career at all.
There is also the same notion that readers can protect themselves against questionable research by watching out for studies with small sample sizes, surprising results, and p values too close to .05. This, when the root of the problem is “p hacking” methods of the kind described by Simmons et al. (2011), such as reporting only a subset of the dependent measures collected and continuing to collect data until a “significant” effect appears. As Simmons et al. show, the effect of these techniques is to vastly inflate the effective p value and in many cases to practically guarantee that at least some publishable result will be obtained. Yet, if the questionable techniques are not reported, there is no way for the reviewers or readers to know they were employed.
I have become so pessimistic about this situation as to think there is really no way to re-establish the credibility of psychological science but to start publishing exact replications in leading journals like PS. And very soon after my post was up, I regretted that I had not made the conclusion stronger, to the effect that if Lindsay were serious, he would move to start publishing exact replications in PS, since nothing less would fix the problem.
Fortunately, in just the past few days I have learned of two encouraging developments that are making me feel a whole lot better. First, PS has a program of “Open Practices” badges that can be awarded to articles whose authors conform to certain open science guidelines. There are three badges: Open Data, indicating that all the study data is permanently available in an open-access repository; Open Materials, indicating that all the study materials are likewise available, and also that the authors have provided complete instructions sufficient to enable outside investigators to perform an exact replication; and best of all, Preregistered, indicating that the design and data analysis plan for the study was preregistered in an open-access repository.
Preregistration of studies is a great idea, because nearly all the jiggery-pokery involved in p hacking would be prevented if an exact plan for data collection and analysis were published in advance of conducting a study. The Open Science Framework is the leading organization known to me that promotes and provides facilities for preregistering studies. The trouble, of course, is that it’s merely voluntary. That’s why something like the badge system in PS is good news. It provides a way for researchers to get credit for doing things right—or well—and a way for readers to gauge the methodological quality of articles. The badges appear on the title pages of the articles and in the journal table of contents. The latest issue of PS has Open Data badges for 7 of 13 articles and an Open Materials badge for 1. The previous issue had 6 Open Data and Open Materials badges out of 15 articles. No Preregistration badges, which is not surprising—that’s a very tough standard. Seems like not a bad start (the program is in its second year). Odd that Lindsay didn’t mention this program in his interview and gives it only the briefest of mentions in his editorial. I learned about it only because it is featured in the current edition of the weekly email that PS sends to APS members.
The second new development is that APS has launched a new type of journal article, the Registered Replication Report (RRR), which reports attempts to exactly replicate key findings. Target findings are identified by their influence and by their theoretical and practical importance. Study design and data collection protocols are developed in collaboration with the original authors when possible and preregistered, and data collection and analysis is performed by multiple institutions. The results across institutions are then meta-analyzed in an attempt to provide a definitive measurement of the effect size. So the RRR represents a formal, resource-intensive effort to exactly replicate a key finding.
Only two RRR’s have been completed so far. The first, published in Perspectives in Psychological Science last October, successfully replicated Schooler’s finding that verbally describing a person one observed commit a simulated bank robbery caused worse performance at recognizing that person in a lineup. Schooler’s effect is one of the vanishing effects featured by Lehrer in the New Yorker article I cited in my post. The successful RRR replication comes with caveats: The RRR effect size (16% worse recognition than controls) is considerably reduced from what Schooler originally found (25%), and the effect is only observed when there is a 20 minute delay between witnessing and verbally describing the event. Verbal description immediately after the event had minimal effect (4%).
The second RRR, to be published in the next issue of Perspectives, examined an effect in which behavior that was described imperfectively (what the person was doing) was interpreted as being more intentional and was imagined in more detail than if it was described perfectively (what the person did). This replication failed.
Replication failure is unpleasant. But the fact that it’s now being done, even on such a limited scale, is good news in the long run for restoring the credibility of psychological science.