Abstract
The likelihood theory of evidence (LTE) says, roughly, that all the information relevant to the bearing of data on hypotheses (or models) is contained in the likelihoods. There exist counterexamples in which one can tell which of two hypotheses is true from the full data, but not from the likelihoods alone. These examples suggest that some forms of scientific reasoning, such as the consilience of inductions (Whewell, 1858. In Novum organon renovatum (Part II of the 3rd ed.). The philosophy of the inductive sciences. London: Cass, 1967), cannot be represented within Bayesian and Likelihoodist philosophies of science.
Similar content being viewed by others
Notes
Terminology varies. In the computer science literature especially, a simple hypothesis is called a model and what I am calling a model is referred to as a model class.
A peculiar thing about the quote from Barnard (above) is that he refers to the likelihood of a simple hypothesis as a probability function. It is not a function except in the very trivial sense of mapping a single hypothesis to a single number.
In contrast, the Law of Likelihood (LL) is very specific about how likelihoods are used in the comparison of simple hypotheses. Forster and Sober (2004) argue that AIC is a counterexample to LL. Unfortunately, Forster and Sober (2004) mistakenly describe LL as the likelihood principle, which was pointed out by Boik (2004) in the same volume. For the record, Forster and Sober (2004) did not intend to say anything about the likelihood principle—the present paper is the first publication in which I have discussed LP.
See Forster (2000) for a description of the best known model selection criteria, and for an argument that the Akaike framework is the conceptually clearest framework for understanding the problem of model selection because it clearly distinguishes criteria from goals.
The term ‘predictive accuracy’ was coined by Forster and Sober (1994), where it is given a precise definition in terms of SOS and likelihood fit functions.
I owe this suggestion to Jason Grossman.
The problem is the same one discussed in Forster, 1988b.
While the refutation is not refutation in the strict logical sense, the number of data in the example can be increased to whatever number you like, so it becomes arbitrarily close to that ideal.
Fitelson (1999) shows that choice of the difference measure does matter in some applications. But that issue does not arise here.
The word ‘constraint’ is borrowed from Sneed (1971), who introduced it as a way of constraining submodels. Although the sense of ‘model’ assumed here is different from Sneed’s, the idea is the same.
Myrvold and Harper (2002) criticize the Akaike criterion of model selection (Forster & Sober, 1994) because it underrates the importance of the agreement of independent measurements in Newton’s argument for universal gravitation (see Harper, 2002 for an intriguing discussion of Newton’s argument). While this paper supports their conclusion, it does so in a more precise and general way. The important advance in this paper is (1) to point out that the limitation applies to all model selection criteria based on the Likelihood Principle and (2) to pinpoint exactly where the limitation lies. Nor is it my conclusion that statistics does not have the resources to address the problem.
Wasserman (2000) provides a nice survey.
References
Aitkin, M. (1991). Posterior Bayes factors. Journal of the Royal Statistical Society B, 53, 111–142.
Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov, & F. Csaki (Eds.), 2nd International symposium on information theory (pp. 267–281). Budapest: Akademiai Kiado.
Barnard, G. A. (1947). Review of Wald’s ‘Sequential analysis’. Journal of the American Statistical Association, 42, 658–669.
Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer-Verlag.
Berger, J. O., & Wolpert, R. L. (1988). The likelihood principle (2nd ed.). Hayward, California: Institute of Mathematical Statistics.
Birnbaum, A. (1962). On the foundations of statistical inference (with discussion). Journal of the American Statistical Association, 57, 269–326.
Boik, R. J. (2004). Commentary. In M. Taper, & S. Lele (Eds.), The nature of scientific evidence (pp. 167–180). Chicago and London: University of Chicago Press.
Burnham, K. P., & Anderson, D. R. (2002). Model selection and multi-model inference. New York: Springer Verlag.
Earman, J. (1978). Fairy tales vs. an ongoing story: Ramsey’s neglected argument for scientific realism. Philosophical Studies, 33, 195–202.
Edwards, A. W. F. (1987). Likelihood (Expanded edition). Baltimore and London: The John Hopkins University Press.
Fitelson, B. (1999). The plurality of Bayesian measures of confirmation and the problem of measure sensitivity. Philosophy of Science, 66, S362–S378.
Forster, M. R. (1984). Probabilistic causality and the foundations of modern science. Ph.D. Thesis, University of Western Ontario.
Forster, M. R. (1986). Unification and scientific realism revisited. In A. Fine, & P. Machamer (Eds.), PSA 1986 (Vol. 1, pp. 394–405). E. Lansing, Michigan: Philosophy of Science Association.
Forster, M. R. (1988a). Unification, explanation, and the composition of causes in Newtonian mechanics. Studies in the History and Philosophy of Science, 19, 55–101.
Forster, M. R. (1988b). Sober’s principle of common cause and the problem of incomplete hypotheses. Philosophy of Science, 55, 538–559.
Forster, M. R. (2000). Key concepts in model selection: Performance and generalizability. Journal of Mathematical Psychology, 44, 205–231.
Forster, M. R. (forthcoming). The miraculous consilience of quantum mechanics. In E. Eells, & J. Fetzer (Eds.), Probability in science. Open Court.
Forster, M. R., & Sober, E. (1994). How to tell when simpler, more unified, or less ad hoc theories will provide more accurate predictions. British Journal for the Philosophy of Science, 45, 1–35.
Forster, M. R., & Sober, E. (2004). Why likelihood? In M. Taper, & S. Lele (Eds.), The nature of scientific evidence (pp. 153–165). Chicago and London: University of Chicago Press.
Friedman, M. (1981). Theoretical explanation. In R. A. Healey (Ed.), Time, reduction and reality (pp. 1–16). Cambridge: Cambridge University Press.
Glymour, C. (1980). Explanations, tests, unity and necessity. Noûs, 14, 31–50.
Hacking, I. (1965). Logic of statistical inference. Cambridge: Cambridge University Press.
Harper, W. L. (2002). Howard Stein on Isaac Newton: Beyond hypotheses. In D. B. Malament (Ed.), Reading natural philosophy: Essays in the history and philosophy of science and mathematics (pp. 71–112). Chicago and La Salle, Illinois: Open Court.
Hooker, C. A. (1987). A realistic theory of science. Albany: State University of New York Press.
Jeffreys, H. (1961). Theory of probability (3rd ed.). Oxford: The Clarendon press.
Mayo, D. G. (1996). Error and the growth of experimental knowledge. Chicago and London: The University of Chicago Press.
Myrvold, W., & Harper, W. L. (2002). Model selection, simplicity, and scientific inference. Philosophy of Science, 69, S135–S149.
Norton, J. D. (1993). The determination of theory by evidence: The case for quantum discontinuity, 1900–1915. Synthese, 97, 1–31.
Norton, J. D. (2000). How we know about electrons. In R. Nola, & H. Sankey (Eds.), After Popper, Kuhn and Feyerabend (pp. 67–97). Kluwer Academic Press.
Pearl, J. (2000). Causality: Models, reasoning, and inference. Cambridge: Cambridge University Press.
Royall, R. M. (1991). Ethics and statistics in randomized clinical trials (with discussion). Statistical Science, 6, 52–88.
Royall, R. M. (1997). Statistical evidence: A likelihood paradigm. Boca Raton: Chapman & Hall/CRC.
Savage, L. J. (1976). On rereading R. A. Fisher (with discussion). Annals of Statistics, 42, 441–500.
Sakamoto, Y., Ishiguro, M., & Kitagawa, G. (1986). Akaike information criterion statistics. Dordrecht: Kluwer Academic Publishers.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–465.
Sneed, J. D. (1971). The logical structure of mathematical physics. Dordrecht: D. Reidel.
Sober, E. (1993). Epistemology for empiricists. In H. Wettstein (Ed.), Midwest studies in philosophy (pp. 39–61). Notre Dame: University of Notre Dame Press.
Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathematical Psychology, 44, 92–107.
Whewell, W. (1858). Novum organon renovatum. Reprinted as Part II of the 3rd ed. of The philosophy of the inductive sciences. London: Cass, 1967.
Whewell, W. (1989). In R. E. Butts (Ed.), Theory of scientific method. Indianapolis/Cambridge: Hackett Publishing Company.
Woodward, J. (2003). Making things happen: A theory of causal explanation. Oxford and New York: Oxford University Press.
Author information
Authors and Affiliations
Corresponding author
Additional information
Thanks go to all those who responded well to the first version of this paper presented at the University of Pittsburgh Center for Philosophy of Science on January 31, 2006, and especially to Clark Glymour. A revised version was presented at Carnegie-Mellon University on April 6, 2006. I also wish to thank Jason Grossman, John Norton, Teddy Seidenfeld, Elliott Sober, Peter Vranas, and three anonymous referees for valuable feedback on different parts of the manuscript.
This paper is part of the ongoing development of a half-baked idea about cross-situational invariance in causal modeling introduced in Forster (1984). I appreciated the encouragement at that time from Jeff Bub, Bill Demopoulos, Michael Friedman, Bill Harper, Cliff Hooker, John Nicholas, and Jim Woodward. Cliff Hooker discussed the idea in his (1987), and Jim Woodward suggested a connection with statistics, which has taken me 20 years to figure out.
Appendix
Appendix
Theorem
If the maximum likelihood hypothesis in F is \(Y=\frac{10}{\sqrt{101}}X+U\) and the observed variance of X is 101, then the observed variance of Y is also 101. Thus, the maximum likelihood hypothesis in B is \(X=\frac{10}{\sqrt{101}}Y+Z,\) and they have the same likelihood. Moreover, for any α, β, and σ, there exist values of a, b, and s such that Y = α + β X + σ U and X = a + bY + sZ have the same likelihood.
Partial Proof
The observed X variance of data distributed in two Gaussian clusters with unit variance centered at X = −10 and X = +10, where the observed means of X and Y are 0, is equal to
where x i denotes X values in the lower cluster and x j denotes X values in the upper cluster. If all the x i where equal to −10, and all the x j were equal to +10, then VarX would be equal to 100. To that, one must add the effect of the local variances. More exactly,
From the equation \(Y=\frac{10}{\sqrt{101}}X+U,\) it follows that \(\hbox{Var}Y=\frac{100}{101}101+1=101.\) Standard formulae for regression curves now prove that \(X=\frac{10}{\sqrt{101}}Y\) is the backwards regression line, where the observed residual variance is also equal to 1. Therefore, the two hypotheses have the same conditional likelihoods, and the same total likelihoods. It follows that the hypotheses \(Y=\frac{10}{\sqrt{101}} X+\sigma U\) and \(X=\frac{10}{\sqrt{101}}Y+\sigma Z\) have the same likelihoods for any value of σ. It is also clear that for any α, β, and σ, there exist values of a, b, and s such that Y = α + β X + σ U and X = a + bY + sZ have the same likelihoods.
Rights and permissions
About this article
Cite this article
Forster, M.R. Counterexamples to a likelihood theory of evidence. Minds & Machines 16, 319–338 (2006). https://doi.org/10.1007/s11023-006-9038-y
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11023-006-9038-y