By Dan Benjamin
Researchers representing a wide range of disciplines and statistical perspectives—72 of us in total—propose to redefine statistical significance. For claims of discoveries of novel effects, we advocate a change in the P-value threshold for a “statistically significant” result from 0.05 to 0.005. Results currently called “significant” that do not meet the new threshold would be called suggestive and treated as ambiguous as to whether there is an effect.
The P-value threshold of 0.05 for statistical significance is arbitrary, and a P-value of 0.05 actually constitutes much weaker evidence against the null hypothesis than most researchers realize. The weakness of evidence from statistical significance at P = 0.05 is one contributing factor for why the replication rate of new findings in the social sciences is surprisingly low.
Many statisticians and methodologically oriented researchers have argued that we should move away from null hypothesis significance testing altogether. I agree, and I advocated one alternative approach in a previous blog post. But there are many alternatives to null hypothesis significance testing, and there is no consensus on which alternative to adopt.
While I believe that in the long run we should move away from null hypothesis significance testing, there is an immediately actionable step that I think would be an improvement over the status quo: reduce the P-value threshold for statistical significance to 0.005 for claims of new discoveries, and refer to results with P-values in between 0.05 and 0.005 as “suggestive” rather than “significant.” This proposal is laid out in a new paper which my co-authors and I have posted on PsyArXiv and which is forthcoming in Nature Human Behaviour. The fact that 72 of us, representing a broad range of disciplines, could agree on this proposal suggests that even broader consensus may be possible for adopting it.
Of course, at some level 0.005 is just as arbitrary as 0.05. But I believe that as a scientific community, we have discovered that the rate of false positives is unacceptably high at the P-value threshold of 0.05, and so it would be better to adopt a more stringent threshold that reduces the false positive rate. It would be better still to move away from arbitrary thresholds—and I hope that the change in definition of statistical significance will hasten, rather than hinder, larger improvements in methodology and increased statistical sophistication.
A more extensive discussion of our new paper is here: