Abstract
A number of authors have recently put forward arguments pro or contra various rules for scoring probability estimates. In doing so, they have skipped over a potentially important consideration in making such assessments, to wit, that the hypotheses whose probabilities are estimated can approximate the truth to different degrees. Once this is recognized, it becomes apparent that the question of how to assess probability estimates depends heavily on context.
Similar content being viewed by others
Notes
Or the quadratic scoring rule, which is a generalization of the Brier score; see below.
That the Brier rule cannot guarantee this is a direct consequence of the general fact mentioned two paragraphs back.
Selten instead prefers the Brier rule, mainly because, as he proves, it is the only scoring rule (up to positive linear transformations) that satisfies each of what he considers to be four important desiderata for such rules, which Selten presents as axioms. According to the first axiom, the ordering of the hypotheses should not influence the score. According to the second, the score should not be affected by the introduction of an additional hypothesis that receives zero probability. The third axiom is the requirement of strict propriety. The fourth axiom, finally, concerns a type of situation that we do not consider in this essay, namely, when a probability assignment is scored in light of another probability assignment rather than in light of the truth of one hypothesis; the axiom requires that, in this situation, the score should be the same regardless of which probability assignment is considered to be the “true” one.
To be entirely precise, instances of the quadratic scoring rule and the VS rule would have to be embellished with a super- or subscript to indicate the weighting function that is being assumed. We will not be so fussy, however.
In this connection, it is also worth mentioning that, at least according to some influential Bayesian statisticians (Gelman and Hill 2007; Gelman and Shalizi 2012, 2013; Kruschke 2013), raising or lowering probabilities in the absence of the kind of evidence with “direct bearing” is accepted as legitimate practice, most notably, as resulting from a so-called posterior predictive check in which a statistical model may be rejected because it is found unsatisfactory (according to informal criteria) in light of simulated data. If rejected, the model is to be replaced by a new one, which requires, among other things, a specification of new prior probabilities. The simulated data that can motivate this kind of model revision—including probability revision—is presumably not the kind of new evidence that Moss has in mind.
This result was obtained by means of the FixedPointList function from Mathematica and therefore holds only up to machine precision. However, that the process would reach a fixed point (even if perhaps not after 442 steps) is guaranteed by Theorem 2, to be stated shortly.
A real-life example of this kind of usage is found in recent work on forecasting carried out by a group of psychologists from various American universities (Mellers et al. 2015; Tetlock and Gardner 2015). These researchers have organized, over a period of several years, a number of prediction tournaments, mostly concerning geopolitical questions. They found that some otherwise ordinary people were much more accurate forecasters than even professional intelligence analysts. A key objective of the research was to determine what distinguishes the most accurate forecasters from the rest of the population. The researchers used a number of different scoring rules for evaluating their participants’ performance, including the Brier score but also the so-called AUROC, which is known to be an improper scoring rule (see, e.g., Agresti 2007, Ch. 5; Hastie et al. 2009, Ch. 9, for details). Given that the participants were never told what the evaluation process consisted of, the use of an improper scoring rule in that process will not have affected their responses. (Note that, although in this research both proper and improper scoring rules were used for the purposes of selection, one could also use an improper scoring rule to select participants while at the same time scoring them via a proper scoring rule to determine their compensation in the experiment. Letting participants know how they will be compensated will then encourage them to post their true probabilities, while the improper scoring rule—the use of which is not disclosed to the participants—may still yield more useful information.)
In fact, to the best of my knowledge, Konek (2016) contains the only reference to the rule (actually, the continuous version of the RPS rule) in the entire philosophical literature.
Because, as noted, the RPS rule is strictly proper, it satisfies Selten’s third axiom (see note 3). To see that it also satisfies his fourth axiom, note that, for comparing a probability assignment \((p_1,\ldots ,p_n)\) with a “true” probability distribution \((p^*_1,\ldots ,p^*_n)\), the RPS rule takes this form:
$$\begin{aligned} \frac{(p_1-p^*_1)^2 + \bigl ((p_1+p_2)-(p^*_1+p^*_2)\bigr )^2 + \cdots + \bigl ((p_1+\cdots + p_n)-(p^*_1+\cdots + p^*_n)\bigr )^2}{n-1}. \end{aligned}$$The symmetry required by the fourth axiom then follows from the fact that the addends in the numerator are all squared. Furthermore, the fact that David’s and Emma’s rank probability scores are different, as seen in the main text, is enough to show that the rule does not satisfy Selten’s first axiom. Finally, to show that neither does it satisfy the second axiom, we can add to the partition consisting of hypotheses A, B, and C the hypothesis that the student will receive a C−, where this has zero probability for David. Keeping his probabilities for A, B, and C as they were, David’s rank probability score then becomes (approximately) 0.243, and hence the addition of the zero-probability alternative did affect the score.
Thanks to Ilkka Niiniluoto for bringing this to my attention.
It might be said that the VS rule used in this section does not do quite as well with respect to the grading example as the RPS rule. Although David does better than Emma—David having a score of 0.117, and Emma, of 0.189—he incurs the same penalty as Frank. However, this result depends on the particular weights we chose for the example. It is easy to choose weights which could still be said to reflect truthlikeness relations but which would lead to qualitatively the same result as the RPS rule.
To my knowledge, the only other author explicitly open to the possibility of “scoring rule pluralism” is Schurz (2018).
I am greatly indebted to Eric Raidl, Christopher von Bülow, Verena Wagner, Sylvia Wenmackers, and two anonymous referees for valuable comments on previous versions of this paper. Thanks also to Lieven Decock, Samuel Fletcher, and Jos Uffink for helpful discussions. Versions of this paper were presented at the Universities of Düsseldorf and Konstanz and at the IHPST (Paris). I thank the audiences on those occasions for stimulating questions and remarks.
References
Agresti, A. (2007). An introduction to categorical data analysis. Hoboken, NJ: Wiley.
Bernardo, J. M. (1979). Expected information as expected utility. Annals of Statistics, 7, 686–690.
Bernardo, J. M., & Smith, A. F. M. (2000). Bayesian theory. New York: Wiley.
Bickel, J. E. (2007). Some comparisons between quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4, 49–65.
Bickel, J. E. (2010). Scoring rules and decision analysis education. Decision Analysis, 7, 346–357.
Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78, 1–3.
Brouwer, L. E. J. (1911). Über Abbildungen von Mannigfaltigkeiten. Mathematische Annalen, 71, 97–115.
Cevolani, G., Festa, R., & Kuipers, T. A. F. (2013). Verisimilitude and belief change for nomic conjunctive theories. Synthese, 190, 3307–3324.
Cooke, R. M. (1991). Experts in uncertainty. Oxford: Oxford University Press.
de Finetti, B. (1962). Does it make sense to speak of ‘good probability appraisers’? In I. J. Good (Ed.), The scientist speculates: An anthology of partly-baked ideas (pp. 357–364). New York: Basic Books.
Epstein, E. S. (1969). A scoring system for probability forecasts of ranked categories. Journal of Applied Meteorology, 8, 985–987.
Gelman, A., & Hill, J. (2009). Data analysis using regression and multilevel/hierarchical models. Cambridge: Cambridge University Press.
Gelman, A., & Shalizi, C. R. (2012). Philosophy and the practice of Bayesian statistics in the social sciences. In H. Kincaid (Ed.), The Oxford handbook of philosophy of social science (pp. 259–273). Oxford: Oxford University Press.
Gelman, A., & Shalizi, C. R. (2013). Philosophy and the practice of Bayesian statistics. British Journal of Mathematical and Statistical Psychology, 66, 8–38.
Good, I. J. (1952). Rational decisions. Journal of the Royal Statistical Society, B14, 107–114.
Greaves, H., & Wallace, D. (2006). Justifying conditionalization: Conditionalization maximizes expected epistemic utility. Mind, 115, 607–632.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning (2nd ed.). New York: Springer.
Joyce, J. (1998). A nonpragmatic vindication of probabilism. Philosophy of Science, 65, 575–603.
Konek, J. (2016). Probabilistic knowledge and cognitive ability. Philosophical Review, 125, 509–587.
Kruschke, J. K. (2013). Posterior predictive checks can and should be Bayesian. British Journal of Mathematical and Statistical Psychology, 66, 45–56.
Kuipers, T. A. F. (2000). From instrumentalism to constructive realism. Dordrecht: Kluwer.
Kuipers, T. A. F. (2001). Structures in Science. Dordrecht: Kluwer.
Kuipers, T. A. F. (2014). Empirical progress and nomic truth approximation revisited. Studies in History and Philosophy of Science, 46, 64–72.
Leitgeb, H., & Pettigrew, R. (2010). An objective justification of Bayesianism I: Measuring inaccuracy. Philosophy of Science, 77, 201–235.
Levinstein, B. A. (2012). Leitgeb and Pettigrew on accuracy and updating. Philosophy of Science, 79, 413–424.
Lombrozo, T. (2017). ‘Learning by thinking’ in science and in everyday life. In P. Godfrey-Smith & A. Levy (Eds.), The scientific imagination. Oxford: Oxford University Press. in press.
McCarthy, J. (1956). Measures of the value of information. Proceedings of the National Academy of Sciences, 42, 654–655.
Mellers, B., Stone, E., Murray, T., Minster, A., Rohrbaugh, N., Bishop, M., et al. (2015). Identifying and cultivating superforecasters as a method of improving probabilistic predictions. Perspectives on Psychological Science, 10, 267–281.
Moss, S. (2011). Scoring rules and epistemic compromise. Mind, 120, 1053–1069.
Murphy, A. (1969). On the ‘ranked probability score. Journal of Applied Meteorology, 8, 988–989.
Niiniluoto, I. (1984). Is science progressive?. Dordrecht: Reidel.
Niiniluoto, I. (1998). Verisimilitude: The third period. British Journal for the Philosophy of Science, 49, 1–29.
Niiniluoto, I. (1999). Critical scientific realism. Oxford: Oxford University Press.
O’Hagan, A., Buck, C. E., Daneshkhah, A., Eiser, J. R., Garthwaite, P. H., Jenkinson, D. J., et al. (2006). Uncertain judgements: Eliciting experts’ probabilities. Hoboken, NJ: Wiley.
Popper, K. R. (1963). Conjectures and refutations. London: Routledge and Kegan Paul.
Rosenkrantz, R. D. (1981). Foundations and applications of inductive probability. Atascadero, CA: Ridgeview Publishing Company.
Schurz, G. (1987). A new definition of verisimilitude and its applications. In P. Weingartner & G. Schurz (Eds.), Logic, philosophy of science and epistemology (Proceedings of the 11th international wittgenstein symposium) (pp. 177–184). Vienna: Hölder-Pichler-Tempsky.
Schurz, G. (1991). Relevant deduction. Erkenntnis, 35, 391–437.
Schurz, G. (2011). Verisimilitude and belief revision. Erkenntnis, 75, 203–221.
Schurz, G. (2014). Philosophy of science: A unified approach. New York: Routledge.
Schurz, G. (2018) The optimality of meta-induction: A new approach to Hume’s problem. Manuscript.
Selten, R. (1998). Axiomatic characterization of the quadratic scoring rule. Experimental Economics, 1, 43–62.
Tetlock, P., & Gardner, D. (2015). Superforecasting: The art and science of prediction. London: Penguin Random House.
Tichý, P. (1974). On Popper’s definition of verisimilitude. British Journal for the Philosophy of Science, 25, 155–160.
Winkler, R. L. (1969). Scoring rules and the evaluation of probability assessors. Journal of the American Statistical Association, 64, 1073–1078.
Winkler, R. L. (1996). Scoring rules and the evaluation of probabilities. Test, 5, 1–60.
Winkler, R. L., & Murphy, A. H. (1968). ‘Good’ probability assessors. Journal of Applied Meteorology, 7, 751–758.
Author information
Authors and Affiliations
Corresponding author
Additional information
This paper is dedicated to Gerhard Schurz, on the occasion of his 60th birthday.
Appendices
Appendix A
Recall that the weights of a VS rule are all positive and add up to 1, and that they are said to reflect truthlikeness in a minimally adequate sense iff hypotheses are assigned weights as a function of their distance from the truth, with hypotheses farther from the truth being assigned larger weights than hypotheses closer to the truth.
Theorem 1
Every VS rule whose weights reflect truthlikeness in a minimally adequate sense is improper.
Proof
Without loss of generality, consider a hypothesis partition of three hypotheses, \(H_1\), \(H_2\), and \(H_3\). Then, where \(\mathcal {V}\) is some VS rule and \(\mathbf {p}=(p_1,p_2,p_3)\) is a given person’s probability assignment to the aforementioned hypotheses, with \(p_i\) the probability assigned to \(H_i\), this person’s expected \(\mathcal {V}\)-score for a probability assignment \(\mathbf {p}^*\!\) to the same hypotheses is given by the function
Again without loss of generality, assume that the hypotheses are ordered by their distances from each other, with \(H_2\) being equally far from \(H_1\) and \(H_3\), and \(H_1\) and \(H_3\) being twice as far from each other as they are from \(H_2\). Then \(w_{11}=w_{33}\), \(w_{21}=w_{23}\), \(w_{31}=w_{13}\), and \(w_{12}=w_{32}\), so that we can simplify notation by defining ; ; ; ; and . For \(\mathcal {V}\) to be proper, it must hold that \(\mathrm{arg\,min}_{\mathbf {p}^*}\!\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]=\mathbf {p}\), for any distribution \(\mathbf {p}\) on \(\{H_1,H_2,H_3\}\). To see whether this does hold, we use the method of Lagrange multipliers. Specifically, where \(f(\mathbf {p}^*)=p^*_1+p^*_2+p^*_3\), we must find values for \(p^*_1\), \(p^*_2\), \(p^*_3\), and \(\lambda \) such that \(\nabla \mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)] = \lambda \nabla f(\mathbf {p}^*)\) and \(f(\mathbf {p}^*)=1\). Calculating the first-order partial derivatives of \(\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]\), we find
Because \(\nabla f(\mathbf {p}^*)=\mathbf {1}\), we have \((\partial /\partial p^*_i)\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]=\lambda \) for all \(i\leqslant 3\). So in particular, expanding the partial derivatives in \(p^*_1\) and \(p^*_3\) and dividing both by 2, we have
and hence
Suppose that \(\mathcal {V}\) is proper, so that \(\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]\) reaches its minimum if \(p_1 = p^*_1\), \(p_2 = p^*_2\), and \(p_3 = p^*_3\). Then there must be values for the \(w_i\) such that
However, factoring the left-hand side yields
This equals 0 iff either (i) \(p_1=p_3\) or (ii) \(w_1=w_4\), where the latter follows from the fact that the condition that the right-hand factor equals 0 can be rewritten as \(w_1(1-p_1-p_3)=w_4 p_2\), in conjunction with the fact that the \(p_i\) sum to 1. Because, as said, for \(\mathcal {V}\) to be proper, it must hold for all \(\mathbf {p}\) that \(\mathrm{arg\,min}_{\mathbf {p}^*}\!\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]=\mathbf {p}\), we may pick a \(\mathbf {p}\) such that \(p_1\ne p_3\), thereby violating (i). As for (ii), note that whichever precise values the \(w_i\) assume, \(w_1\) must be smaller than 1/3 (given that it is assigned to the supposed truth) and \(w_4\) must be greater than 1/3 (given that it is assigned to the two hypotheses supposed false). Consequently, on the supposition that \(\mathcal {V}\) is proper, we can minimize \(\mathbb {E}_{\mathbf {p}}[\mathcal {V}(\mathbf {p}^*)]\) subject to the given constraint iff the truthlikeness weights assigned by the rule do not reflect truthlikeness in a minimally adequate sense. By assumption, the weights do reflect truthlikeness in a minimally adequate sense. Given that we made no further assumptions about \(\mathcal {V}\), it follows that every VS rule is improper if it assigns truthlikeness weights in a minimally adequate fashion. \(\square \)
Remark
The above proof proceeds by constructing a specific counterexample involving three hypotheses that are assumed to stand in specific relations of truthlikeness to each other. To see that this assumption does not undermine the generality of the proof, we note that the said relations are perfectly possible according to all modern measures of truthlikeness (see page 5 for references). As a matter of fact, one can think of our earlier example concerning the possible grades (A, B, or C) a given student may receive as instantiating exactly the relations of truthlikeness that are assumed to hold in the counterexample. It is also to be noted, however, that not all known measures of truthlikeness will do for the purposes of the proof. Most famously, Tichý (1974) discovered that on Popper’s (1963) measure all false theories are equally far from the truth, contrary to what Popper had hoped to achieve with his measure.
Appendix B
In this appendix we prove
Theorem 2
Let S be the standard unit \((n-1)\)-simplex, let \(\mathbf {p}\) and \(\mathbf {p}^*\!\) range over vectors in S, and let \(m:S\rightarrow S\) be defined as follows:
with \(\delta _{ij}\) the Kronecker delta, and with \(w_{ij}>0\) for all i, j, and \(\sum _{i=1}^n\sum _{j=1}^n w_{ij} = 1\). Then there is a \(\mathbf {p}^+\!\in S\) such that (i) \(m(\mathbf {p}^+)=\mathbf {p}^+\!\), (ii) \(\mathbf {p}^+\) is unique, and (iii) \(\mathbf {p}^+\) depends only on the \(w_{ij}\).
Proof
Clause (i) follows from Brouwer’s (1911) fixed-point theorem, which (in one version) states that every continuous function from a simplex onto itself has a fixed point. It does not follow from Brouwer’s theorem that the fixed point is unique.
To prove clause (ii), then, one first verifies that the function that is being minimized at each step on the way to the fixed point has the Hessian
This is a diagonal matrix, so its eigenvalues are the diagonal elements, which, given the constraints on the \(p_i\) and \(w_{ij}\), can be seen to be all necessarily positive. Therefore, the Hessian is positive definite everywhere, and given that a simplex is a convex set, it follows that the function that is minimized is strictly convex, and hence the minimum it reaches is unique. So, at each step toward the fixed point, a unique minimum is reached. As a result, the minimum reached at the fixed point is unique as well.
For clause (iii), finally, note that at the fixed point the function that is being minimized is of the form
Because the fixed point \(\mathbf {p}^+\) is a minimum, it holds that \(\nabla m^+(\mathbf {p}^+)=\mathbf {0}\). We obtain a system of n polynomial equations with n variables and with the \(w_{ij}\) as coefficients by setting \((\partial / \partial p_i)m^+(\mathbf {p}^+)=0\), for all \(i\leqslant n\). This system has a unique solution (in virtue of the first two clauses), which is bound to be strictly in terms of the coefficients. \(\square \)
Rights and permissions
About this article
Cite this article
Douven, I. Scoring in context. Synthese 197, 1565–1580 (2020). https://doi.org/10.1007/s11229-018-1867-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11229-018-1867-8