Abstract
Benjamin et al. Nature Human Behavior 2 (1), 6–10 (2018) proposed decreasing the significance level by an order of magnitude to improve the replicability of psychology. This modest, practical proposal has been widely criticized, and its prospects remain unclear. This article defends this proposal against these criticisms and highlights its virtues.
Similar content being viewed by others
Notes
Supposing that these true negatives are ever observed, an unlikely outcome if people heavily engage in questionable research practices that ensure reaching the significance level (Simmons et al. 2011)
Psychological Science publishes articles in various areas of psychology.
That is, the proportion of replicated studies whose p-values is below .05.
See also Trafimow (2018) for discussion of how to measure replicability.
We do not claim originality for many of the points put forward in our paper.
The Bayes Factor argument does not rely on the ascription of probabilities to hypotheses about parameter values: No probability is assigned to the null model or to the alternative model.
This is only the case when the standard deviation is the same for the null and alternative models. Thanks to Justin fisher for noting this point.
Morey (2018) also criticizes this empirical argument, but he misunderstands its point. The argument does not aim to show that there are fewer failed replications for original p’s ≤ .005 than for .005 < original p’s ≤ .05, but rather to give a sense of how much replicability could increase following a reduction of the significance level.
This restriction applies only to direct replications and not to conceptual replications (for a criticism of this distinction, see however Machery n.d.).
It is for example surprising that critics of null hypothesis significance testing fail to see that even in the absence of a cutoff scientists would engage in practices that exaggerate how much evidence they have for their pet hypotheses.
Morey (2018) also shows that according to his own representation of two-sided tails as pairs of one-sided tails, a p-value equal to .02 provides substantial evidence for a directional hypothesis.
Given that many null hypotheses are literally false (there is very often a tiny effect), Lakens and colleagues’ remark challenges the common assumption that by rejecting a point null hypothesis one is also entitled to conclude that the effect is at most negligible (Machery 2014).
One may question this appeal to syncretism since the choice of a .005 level is only justified on Bayesian grounds. A true syncretic approach would instead justify it on Bayesian and on frequentist grounds. However, first, the appeal to syncretism is meant to undermine the idea that Bayesian considerations are always irrelevant to a frequentist. Even is no frequentist justification is provided, a syncretist can’t dismiss the relevance of Bayesian considerations. Second, the argument from the false discovery rate can be given a frequentist interpretation: It examines the frequency of false positives among significant results for various possible base rate of true null hypotheses, exactly as we would do when we assess whether a medical test is sufficiently sensitive.
This presentation simplifies slightly Crane’s presentation, but nothing of importance is lost (see eq. 9 in Crane, n.d.).
This type of objection would undermine various other proposals that take for granted the null hypothesis significance testing framework (e.g., preregistration).
Argamon also flirts with the everything-or-nothing attitude that we criticized earlier when we discussed Trafimow et al. (2018).
Noone thinks it is sufficient.
Our proposal is entirely consistent with a metaanalytic approach, and it is unclear why, as Lakens et al. (2018, 169) assert, our proposal would “divert attention from the cumulative evaluation of findings, such as converging results of multiple (replication) studies.”
What if the significance level is used as publication filter? We then need to distinguish the situation where the null hypothesis is true and those where the null hypothesis is false. When the null hypothesis is true, effect size inflation increases with a decreased significance level, even if the sample size increases to maintain power constant. However, when the null is false, such increase need not be the case. P-values are right skewed when the null is false and the extent of the skew depends on the sample size for constant population parameters. So, if decreasing the significance level results in an increase in sample size, a larger number of p-values may be significant for a smaller significance level. As a result, effect size inflation may decrease rather than increase.
References
Amrhein, V., and S. Greenland. 2018. Remove, rather than redefine, statistical significance. Nature Human Behaviour 2: 4.
Amrhein, V., F. Korner-Nievergelt, and T. Roth. 2017. The earth is flat (p> 0.05): Significance thresholds and the crisis of unreplicable research. PeerJ 5: e3544.
Amrhein, V., D. Trafimow, and S. Greenland. 2018. Abandon statistical inference. PeerJ Preprints 6: e26857v1. https://doi.org/10.7287/peerj.preprints.26857v1.
Argamon, S. E. (2017). Don’t strengthen statistical significance—Abolish it. https://www.americanscientist.org/blog/macroscope/dont-strengthen-statistical-significance-abolish-it.
Baker, M., and E. Dolgin. 2017. Cancer reproducibility project releases first results. Nature 541: 269–270.
Begley, C.G., and L.M. Ellis. 2012. Drug development: Raise standards for preclinical cancer research. Nature 483: 531–533.
Benjamin, D., Berger, J., Johannesson, M., Johnson, V., Nosek, B., & Wagenmakers, E. J. (2017). Précis by Dan Benjamin, Jim Berger, Magnus Johannesson, Valen Johnson, Brian Nosek, and EJ Wagenmakers. http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx.
Benjamin, D.J., J.O. Berger, M. Johannesson, B.A. Nosek, E.–.J. Wagenmakers, R. Berk, K.A. Bollen, B. Brembs, L. Brown, C. Camerer, D. Cesarini, C.D. Chambers, M. Clyde, T.D. Cook, P. De Boeck, Z. Dienes, A. Dreber, K. Easwaran, C. Efferson, E. Fehr, F. Fidler, A.P. Field, M. Forster, E.I. George, R. Gonzalez, S. Goodman, E. Green, D.P. Green, A. Greenwald, J.D. Hadfield, L.V. Hedges, L. Held, T.–.H. Ho, H. Hoijtink, J.H. Jones, D.J. Hruschka, K. Imai, G. Imbens, J.P.A. Ioannidis, M. Jeon, M. Kirchler, D. Laibson, J. List, R. Little, A. Lupia, E. Machery, S.E. Maxwell, M. McCarthy, D. Moore, S.L. Morgan, M. Munafó, S. Nakagawa, B. Nyhan, T.H. Parker, L. Pericchi, M. Perugini, J. Rouder, J. Rousseau, V. Savalei, F.D. Schönbrodt, T. Sellke, B. Sinclair, D. Tingley, T. Van Zandt, S. Vazire, D.J. Watts, C. Winship, R.L. Wolpert, Y. Xie, C. Young, J. Zinman, and V.E. Johnson. 2018. Redefine statistical significance. Nature Human Behavior 2 (1): 6–10.
Bright, L. K. (2017). Supporting the redefinition of statistical significance. http://sootyempiric.blogspot.com/2017/07/supporting-redefinition-of-statistical.html.
Button, K.S., J.P. Ioannidis, C. Mokrysz, B.A. Nosek, J. Flint, E.S. Robinson, and E.R. Munafò. 2013. Power failure: Why small sample size undermines the reliability of neuroscience. Nature Review Neuroscience 14: 365376. https://doi.org/10.1038/nrn3475.
Chang, A. C., & Li, P. (2015). Is economics research replicable? Sixty published papers from thirteen journals say ‘usually not’. https://doi.org/10.17016/FEDS.2015.083. Available at SSRN: https://ssrn.com/abstract=2669564 or https://doi.org/10.2139/ssrn.2669564
Cohen, J. 1962. The statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology 65: 145–153.
Colquhoun, D. 2014. An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science 1 (3): 140216.
Cox, D.R. 1977. The role of significance tests. Scandinavian Journal of Statistics 4: 49–63.
Crane, H. (n.d.). Why ‘redefining statistical significance’ will not improve reproducibility and could make the replication crisis worse.
de Ruiter. 2019. Redefine or justify? Comments on the alpha debate. Psychonomic Bulletin & Review 26 (2): 430–433.
Esarey, J. (2017). Lowering the threshold of statistical significance to p < 0.005 to encourage enriched theories of politics. https://thepoliticalmethodologist.com/2017/08/07/in-support-of-enriched-theories-of-politics-a-case-for-lowering-the-threshold-of-statistical-significance-to-p-0-00
Etz, A., and J. Vandekerckhove. 2016. A Bayesian perspective on the reproducibility project: Psychology. PLoS One 11 (2): e0149794.
Fanelli, D. 2010. “Positive” results increase down the hierarchy of the sciences. PLoS One 5 (4): e10068.
Fraley, R.C., and S. Vazire. 2014. The N-pact factor: Evaluating the quality of empirical journals with respect to sample size and statistical power. PLoS One 9 (10): e109019.
García-Pérez, M.A. 2017. Thou shalt not bear false witness against null hypothesis significance testing. Educational and Psychological Measurement 77: 631–662.
Gelman, A. (2017a). When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect! http://andrewgelman.com/2017/10/03/one-discussion-redefining-abandoning-statistical-significance/.
Gelman, A. (2017b). Response to some comments on “abandon statistical significance.” http://andrewgelman.com/2017/10/02/response-comments-abandon-statistical-significance/.
Giner-Sorolla, R., (2018). Justify your alpha … for its audience. https://approachingblog.wordpress.com/2018/03/28/justify-your-alpha-to-an-audience/.
Greenland, S. 2010. Comment: The need for syncretism in applied statistics. Statistical Science 25 (2): 158–161.
Greenwald, A.G. 1976. An editorial. Journal of Personality and Social Psychology 33: 1–7.
Guilera, G., M. Barrios, and J. Gómez-Benito. 2013. Meta-analysis in psychology: A bibliometric study. Scientometrics 94 (3): 943–954.
Hamlin, K. (2017). Commentary by Kiley Hamlin. http://philosophyofbrains.com/2017/10/02/should-we-redefine-statistical-significance-a-brains-blog-roundtable.aspx.
Ioannidis, J.P.A. 2005. Why most published research findings are false. PLoS Medicine 2 (8): e124.
Ioannidis, J.P.A. 2016. The mass production of redundant, misleading, and conflicted systematic reviews and meta-analyses. The Milbank Quarterly 94 (3): 485–514.
Lakens, D., F.G. Adolfi, C.J. Albers, F. Anvari, M.A.J. Apps, S.E. Argamon, T. Baguley, R.B. Becker, S.D. Benning, D.E. Bradford, E.M. Buchanan, A.R. Caldwell, B. Van Calster, R. Carlsson, S.-C. Chen, B. Chung, L.J. Colling, G.S. Collins, Z. Crook, E.S. Cross, S. Daniels, H. Danielsson, L. DeBruine, D.J. Dunleavy, B.D. Earp, M.I. Feist, J.D. Ferrell, J.G. Field, N.W. Fox, A. Friesen, C. Gomes, M. Gonzalez-Marquez, J.A. Grange, A.P. Grieve, R. Guggenberger, J. Grist, A.-L. van Harmelen, F. Hasselman, K.D. Hochard, M.R. Hoffarth, N.P. Holmes, M. Ingre, P.M. Isager, H.K. Isotalus, C. Johansson, K. Juszczyk, D.A. Kenny, A.A. Khalil, B. Konat, J. Lao, E.G. Larsen, G.M.A. Lodder, J. Lukavský, C.R. Madan, D. Manheim, S.R. Martin, A.E. Martin, D.G. Mayo, R.J. McCarthy, K. McConway, C. McFarland, A.Q.X. Nio, G. Nilsonne, C.L. de Oliveira, J.-J.O. de Xivry, S. Parsons, G. Pfuhl, K.A. Quinn, J.J. Sakon, S.A. Saribay, I.K. Schneider, M. Selvaraju, Z. Sjoerds, S.G. Smith, T. Smits, J.R. Spies, V. Sreekumar, C.N. Steltenpohl, N. Stenhouse, W. Świątkowski, M.A. Vadillo, M.A.L.M. Van Assen, M.N. Williams, S.E. Williams, D.R. Williams, T. Yarkoni, I. Ziano, and R.A. Zwaan. 2018. Justify your alpha. Nature Human Behaviour 2 (3): 168–171.
Lemoine, N.P., A. Hoffman, A.J. Felton, L. Baur, F. Chaves, J. Gray, Q. Yu, and M.D. Smith. 2016. Underappreciated problems of low replication in ecological field studies. Ecology 97 (10): 2554–2561.
Lindley, D.V. 1957. A statistical paradox. Biometrika 44: 187–192.
Machery, E. 2014. Significance testing in neuroimagery. In New waves in the philosophy of mind, ed. J. Kallestrup and M. Sprevak, 262–277. Palgrave Macmillan.
Machery, E. (n.d.). What is a replication?.
Malinsky, D. (2017). Significant moral hazard. https://sootyempiric.blogspot.com/2017/08/significant-moral-hazard.html.
Marsman, M., and E.J. Wagenmakers. 2017. Three insights from a Bayesian interpretation of the one-sided p-value. Educational and Psychological Measurement 77 (3): 529–539.
Mayo, D. (2017a). Commentary by Deborah Mayo. http://philosophyofbrains.com/2017/10/02/should-we-redefinestatistical-significance-a-brains-blog-roundtable.aspx.
Mayo, D. (2017b). Why significance testers should reject the argument to “redefine statistical significance”, even if they want to lower the p-value. https://errorstatistics.com/2017/12/17/why-significance-testers-should-reject-the-argument-to-redefine-statistical-significance-even-if-they-want-to-lower-the-p-value/.
McShane, B. B., Gal, D., Gelman, A., Robert, C., & Tackett, J. L. (2018). Abandon statistical significance. April 9, 2018.
Meehl, P.E. 1990. Why summaries of research on psychological theories are often uninterpretable. Psychological Reports 66: 195–244.
Morey, E. (2017). When the statistical tail wags the scientific dog. Should we ‘redefine’ statistical significance? https://medium.com/@richarddmorey/when-the-statistical-tailwags-the-scientific-dog-d09a9f1a7c63.
Morey, E. (2018). Redefining statistical significance: The statistical arguments. https://medium.com/@richarddmorey/redefining-statistical-significance-the-statistical-arguments-ae9007bc1f91.
Oakes, L.M. 2017. Sample size, statistical power, and false conclusions in infant looking-time research. Infancy 22 (4): 436–469.
Open Science Collaboration (2015). Estimating the reproducibility of psychological science. Science, 349, aac4716. https://doi.org/10.1126/science.aac4716.
Peters, G. J. (2017). Appropriate humility: Choosing sides in the alpha wars based on psychology rather than methodology and statistics. https://sciencer.eu/2017/08/appropriate-humility-choosing-sides-in-the-alpha-wars-based-on-psychology-rather-than-methodology-and-statistics/.
Schimmack, U. (2017). What would Cohen say? A comment on p < .005. https://replicationindex.wordpress.com/2017/08/02/what-would-cohen-say-a-comment-on-p-005/.
Schmalz, X. (2018). By how much would we need to increase our sample sizes to have adequate power with an alpha level of 0.005? http://xeniaschmalz.blogspot.ca/2018/02/by-how-much-would-we-need-to-increase.html?
Sedlmeier, P., and G. Gigerenzer. 1989. Do studies of statistical power have an effect on the power of studies? Psychological Bulletin 105: 309–316.
Simmons, J.P., L.D. Nelson, and U. Simonsohn. 2011. False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science 22 (11): 1359–1366.
Simonsohn, U., J.P. Simmons, and L.D. Nelson. 2015. Better P-curves: Making P-curve analysis more robust to errors, fraud, and ambitious P-hacking, a reply to Ulrich and Miller (2015). Journal of Experimental Psychology: General 144 (6): 1146–1152.
Trafimow, D. 2018. An a priori solution to the replication crisis. Philosophical Psychology 31: 1188–1214.
Trafimow, D., V. Amrhein, C.N. Areshenkoff, C. Barrera-Causil, E.J. Beh, Y. Bilgiç, R. Bono, M.T. Bradley, W.M. Briggs, H.A. Cepeda-Freyre, S.E. Chaigneau, D.R. Ciocca, J. Carlos Correa, D. Cousineau, M.R. de Boer, S.S. Dhar, I. Dolgov, J. Gómez-Benito, M. Grendar, J. Grice, M.E. Guerrero-Gimenez, A. Gutiérrez, T.B. Huedo-Medina, K. Jaffe, A. Janyan, A. Karimnezhad, F. Korner-Nievergelt, K. Kosugi, M. Lachmair, R. Ledesma, R. Limongi, M.T. Liuzza, R. Lombardo, M. Marks, G. Meinlschmidt, L. Nalborczyk, H.T. Nguyen, R. Ospina, J.D. Perezgonzalez, R. Pfister, J.J. Rahona, D.A. Rodríguez-Medina, X. Romão, S. Ruiz-Fernández, I. Suarez, M. Tegethoff, M. Tejo, R. van de Schoot, I. Vankov, S. Velasco-Forero, T. Wang, Y. Yamada, F.C. Zoppino, and F. Marmolejo-Ramos. 2018. Manipulating the alpha level cannot cure significance testing. Frontiers in Psychology 9, article 699. https://doi.org/10.3389/fpsyg.2018.00699.
Vankov, I., J. Bowers, and M.R. Munafò. 2014. On the persistence of low power in psychological science. The Quarterly Journal of Experimental Psychology 67 (5): 1037–1040.
Wegner, D.M. 1992. The premature demise of the solo experiment. Personality and Social Psychology Bulletin 18 (4): 504–508.
Zollman, K. (2017). Commentary by Kevin Zollman. http://philosophyofbrains.com/2017/10/02/should-we-redefinestatistical-significance-a-brains-blog-roundtable.aspx.
Acknowledgements
I owe the expression “alpha war” to Simine Vazire. Thanks to John Doris, Felipe Romero, and two reviewers for very helpful feedback.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Machery, E. The Alpha War. Rev.Phil.Psych. 12, 75–99 (2021). https://doi.org/10.1007/s13164-019-00440-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13164-019-00440-1