Abstract
Biometrics has done damage with levels of R or p or Student’s t. The damage widened with Ronald A. Fisher’s victory in the 1920s and 1930s in devising mechanical methods of “testing,” against methods of common sense and scientific impact, “oomph.” The scale along which one would measure oomph is particularly clear in biomedical sciences: life or death. Cardiovascular epidemiology, to take one example, combines with gusto the “fallacy of the transposed conditional” and what we call the “sizeless stare” of statistical significance. Some medical editors have battled against the 5% philosophy, as did, for example, Kenneth Rothman, the founder of Epidemiology. And decades ago a sensible few in education, ecology, and sociology initiated a “significance test controversy.” But, grantors, journal referees, and tenure committees in the statistical sciences had faith that probability spaces can substitute for scientific judgment. A finding of p <.05 is deemed to be “better” for variable X than p <.11 for variable Y. It is not. It depends on the oomph of X and Y—the effect size, size judged in the light of how much it matters for scientific or clinical purposes. In 1995 a Cancer Trialists’ Collaborative Group, for example, came to a rare consensus on effect size: 10 different studies had agreed that a certain drug for treating prostate cancer can increase patient survival by 12%. An 11th study published in the New England Journal in 1998 dismissed the drug. The dismissal was based on a t-test, not on what William Gosset (the “Student” of Student’s t) had called, against Ronald A. Fisher’s machinery, “real” error.
Similar content being viewed by others
References
Altaian DG (1991) Statistics in medical journals: Developments in the 1980s. Statistics in Medicine 10: 1897–1913.
American Psychological Association (APA) 1952 to 2001 [revisions] Publication Manual of the American Psychological Association. Washington, DC: APA.
Berger JO (2003) Could Fisher, Jeffreys, and Neyman have agreed on testing? Statistical Science 18: 1–32.
Cohen J (1994) The earth is round (p < 0.05). American Psychologist 49: 997–1003.
David FN, ed (1966) Research Papers in Statistics: Festschrift for J. Neyman. London: Wiley.
Eisenberger MA, Blumenstein BA, Crawford ED, Miller G, McLeod DG, Loehrer PJ, Wilding G, Sears K, Culkin DJ, Thompson IM, Bueschen AJ, Lowe BA (1998) Bilateral orchiectomy with or without flutamide for metastatic prostate cancer. New England Journal of Medicine 339: 1036–1042.
Fidler F (2002) The fifth edition of the APA Publication Manual: Why its statistics recommendations are so controversial. Educational and Psychological Measurement 62: 749–770.
Fidler F, Thomason N, Cumming G, Finch S, Leeman J (2004) Editors can lead researchers to confidence intervals but they can’t make them think: Statistical reform lessons from medicine. Psychological Science 15: 119–126.
Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philosophical Transactions of the Royal Society A 222: 309–368.
Fisher RA (1926) Bayes’ Theorem. Eugenics Review 18: 32–33.
Fisher RA ([1956] 1959) Statistical Methods and Scientific Inference, 2nd ed. New York: Hafner.
Fleiss JL (1986) Significance tests do have a role in epidemiological research: Reaction to AA Walker. American Journal of Public Health 76: 559–600.
Freiman JA, Chalmers T, Smith H, Kuebler RR (1978) The importance of beta, the type II error and sample design in the design and interpretation of the randomized control trial: Survey of 71 negative trials. New England Journal of Medicine 299: 690–694.
Goodman S (1999a) Toward evidence-based medical statistics. 1: The p-value fallacy. Annals of Internal Medicine 130: 995–1004.
Hoover K, Siegler M (2008) Sound and fury: McCloskey and significance testing in economics. Journal of Economic Methodology 15: 1–37.
International Committee of Medical Journal Editors (ICMJE) (1988) Uniform requirements for … statisticians and biomedical journal editors. Statistics in Medicine 7: 1003–1011.
Jeffreys H (1963) Review of L. J. Savage, et al., The Foundations of Statistical Inference (Methuen, London and Wiley, New York, 1962). Technometrics 5: 407–410.
Klein H, Elifson KW, Sterk CE (2003) Perceived temptation to use drugs and actual drug use among women. Journal of Drug Issues 33: 161–192.
Lang JM, Rothman KJ, Cann CI (1998) That confounded p-value. Epidemiology 9: 7–8.
Pearson ES (1990) [posthumously published by Plackett RL, Barnard GA, eds] ‘Student’: A Statistical Biography of William Sealy Gosset. Oxford: Clarendon Press.
Rennie D (1978) Vive la Difference (p < 0.05). New England Journal of Medicine 299: 828–829.
Rossi J (1990) Statistical power of psychological research: What have we gained in 20 years? Journal of Consulting and Clinical Psychology 58: 646–656.
Rothman KJ (1978) A show of confidence. New England Journal of Medicine 299: 1362–1363.
Rothman KJ (1986) Modern Epidemiology. New York: Little, Brown.
Rothman KJ (1990) Writing for epidemiology. Epidemiology 9: 333–337.
Rothman KJ, Johnson ES, Sugano DS (1999) Is flutamide effective in patients with bilateral orchiectomy? Lancet 353: 1184.
Savitz DA, Tolo K, Poole C (1994) Statistical significance testing in the American Journal of Epidemiology, 1970–1990. American Journal of Epidemiology 139: 1047–1052.
Shyrock RH (1961) The history of quantification in medical science. Isis 52: 215–237.
Sterne JAC, Davey Smith G (2001) Sifting the evidence—What’s wrong with significance tests? British Medical Journal 322: 226–231.
Zabell S (1989) R. A. Fisher on the history of inverse probability. Statistical Science 4: 247–263.
Zellner A (1984) Basic Issues in Econometrics. Chicago: University of Chicago Press.
Ziliak ST, Hannon J (2006) Public assistance: Colonial times to the 1920s. In Historical Statistics of the United States. (Carter SB, Gartner SS, Haines MR, Olmstead AL, Sutch R, Wright G, eds). New York: Cambridge University Press.
Ziliak ST, McCloskey DN (2008) The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives. Ann Arbor, MI: University of Michigan Press.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
McCloskey, D.N., Ziliak, S.T. The Unreasonable Ineffectiveness of Fisherian “Tests” in Biology, and Especially in Medicine. Biol Theory 4, 44–53 (2009). https://doi.org/10.1162/biot.2009.4.1.44
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1162/biot.2009.4.1.44