Mathematics and Statistics in the Social Sciences Stephan Hartmann† and Jan Sprenger‡ Over the years, mathematics and statistics have become increasingly important in the social sciences1. A look at history quickly confirms this claim. At the beginning of the 20th century most theories in the social sciences were formulated in qualitative terms while quantitative methods did not play a substantial role in their formulation and establishment. Moreover, many practitioners considered mathematical methods to be inappropriate and simply unsuited to foster our understanding of the social domain. Notably, the famous Methodenstreit also concerned the role of mathematics in the social sciences. Here, mathematics was considered to be the method of the natural sciences from which the social sciences had to be separated during the period of maturation of these disciplines. All this changed by the end of the century. By then, mathematical, and especially statistical, methods were standardly used, and their value in the social sciences became relatively uncontested. The use of mathematical and statistical methods is now ubiquitous: Almost all social sciences rely on statistical methods to analyze data and form hypotheses, and almost all of them use (to a greater or lesser extent) a range of mathematical methods to help us understand the social world. Additional indication for the increasing importance of mathematical and statistical methods in the social sciences is the formation of new subdisciplines, and the establishment of specialized journals and societies. Indeed, subdisciplines such as Mathematical Psychology and Mathematical Sociology emerged, and corresponding journals such as The Journal of Mathematical Psychology (since 1964), The Journal of Mathematical Sociology (since 1976), Mathematical Social Sciences (since 1980) as well as the online journals Journal of Artificial Societies and Social Simulation (since 1998) and Mathematical Anthropology and Cultural Theory (since 2000) were established. What is more, societies such as the Society for Mathematical Psychology (since 1976) and the Mathematical Sociology Section of the American Sociological Association (since 1996) were founded. Similar developments can be observed in other countries. The mathematization of economics set in somewhat earlier (Vazquez 1995; Weintraub 2002). However, the use of mathematical methods in economics started booming only in the second half of the last century (Debreu 1991). Contemporary economics is dominated by the mathematical approach, although a certain style of †Tilburg Center for Logic and Philosophy of Science, Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands, e-mail: s.hartmann@uvt.nl, webpage: www.stephanhartmann.org. ‡Tilburg Center for Logic and Philosophy of Science, Tilburg University, P.O. Box 90153, 5000 LE Tilburg, The Netherlands, e-mail: j.sprenger@uvt.nl, webpage: www.laeuferpaar.de. 1 In our usage, 'social science' includes disciplines such as anthropology, political science, and sociology, but also economics and parts of linguistics and psychology. 2 doing economics became more and more under attack in the last decade or so. Recent developments in behavioral economics and experimental economics can also be understood as a reaction against the dominance (and limitations) of an overly mathematical approach to economics. There are similar debates in other social sciences. It is, however, important to stress that problems of one method (such as axiomatization or the use of set theory) can hardly be taken as a sign of bankruptcy of mathematical methods in the social sciences tout court. This chapter surveys mathematical and statistical methods used in the social sciences and discusses some of the philosophical questions they raise. It is divided into two parts. Sections 1 and 2 are devoted to mathematical methods, and Sections 3 to 7 to statistical methods. As several other chapters in this handbook provide detailed accounts of various mathematical methods, our remarks about the latter will be rather short and general. Statistical methods, on the other hand, will be discussed in-depth. 1. A Plurality of Mathematical Methods Social scientists use a wide variety of mathematical methods.2 Given the space constraints of the present chapter, it is impossible to list them all, give examples, examine their domain of applicability, and discuss their methodological problems. Instead, we broadly distinguish between three different kinds of methods: (i) methods imported from the formal sciences, (ii) methods imported from the natural sciences, and (iii) social scientific methods sui generis. We review them in turn. Methods imported from the formal sciences include (linear) algebra, calculus (including differential equations), the axiomatic method, logic and set theory, probability theory (including Markov chains), linear programming, topology, graph theory, and complexity theory. All these methods have important applications in the social sciences.3 The axiomatic method nicely illustrates what one can call the mathematician's approach to the social sciences. Here, a set of general principles is formulated, which enable the study of the formal aspects of the system under investigation. The tradition of proving impossibility theorems in social choice theory is a good example for this approach. In recent years, we have seen the importation of various methods from computer science into the social sciences. There is also a strong trend within computer science to address problems from the social sciences. An example is the recent establishment of the new interdisciplinary field computational social choice which is dominated by computer scientists.4 Interestingly, much work in computational social choice uses analytical and logical methods. There is, however, also a strong trend in the social sciences to use powerful numerical and simulation methods to explore complex and (typically) dynamical social phenomena. The reason for this is, of course, the 2 Throughout this chapter, we use the word 'method' in a rather broad sense, including specific methods such as the axiomatic method as well as more specific tools like utility theory. The latter is a method in the sense that it is used to address certain questions that arise in the social sciences. 3 For a lucid exposition of many of these methods, and interesting though somewhat outdated examples from the social sciences, see Luce and Suppes (1968). 4 See http://www.illc.uva.nl/COMSOC/ 3 availability of high-powered computers. But not all social scientists follow this trend. Especially many economists are reluctant to employ simulation methods and do not consider them appropriate tools for the study of economic systems.5 Methods imported from the natural sciences are becoming increasingly popular in the social sciences. These methods are more specific than the formal methods mentioned above. They involve substantial assumptions that happen – or so it is claimed – to be fulfilled in the social domain. These methods comprise tools for the study of multi-agent systems, the theory of complex systems, non-linear dynamics, methods developed in synergetics (Weidlich 2006) and, more recently, in econophysics (Mantegna and Stanley 1999). The applicability of these methods follows from the 'observation' that societies are nothing but many-body systems (like a gas is a many-body system composed of molecules) that exhibit certain features such as the emergence of ordering phenomena. Hence, these features can be accounted for in terms of a statistical description, just like the behavior of gases and other many-body systems which are studied in the natural sciences. Such methods are also used in new interdisciplinary fields such as environmental economics. Besides providing various methods for the study of social phenomena, the natural sciences also inspired a certain way of addressing a problem. Meanwhile, model building is considered to be the core activity in the social.6 The developed models contain idealized assumptions, and their consequences are often obtained with the help of simulations. Due to its striking simiplarity with physics, we call this approach the physicist's approach to social science, and contrast it with the mathematician's approach to social science, described above. Finally, there are mathematical methods that emerged from problems in the social sciences. These include powerful instruments such as decision theory7, utility theory, game theory8, measurement theory (Krantz et al. 1971), social choice theory (Gaertner 2006), and judgment aggregation (List and Puppe 2009). The latter were invented by social scientists for social scientists, with a specific social-science application in mind. They help addressing specific problems that arise in the context of the social sciences that did not have an analogue in the natural sciences when they were invented. Only later some of these theories also turned out to be useful in the natural sciences or have been combined with insights from the natural sciences. Evolutionary game theory is a case in point.9 Other interesting examples include the study of quantum games (Piotrowski and Sladkowski 2003) and the application of decision theory in fundamental physics (Wallace 2010). Many of the methods that emerged from problems in the social sciences are in line with the mathematician's approach, although the physicist's approach is increasingly gaining ground. 5 For a discussion of computer simulations in the social sciences, see Hegselmann et al. (1996). In this context it is interesting to study the influence of the work done at the Santa Fe Institute on mainstream economics. See e.g. Anderson et al (1988). See also Waldrop (1992). 6 For a more detailed discussion of modeling in the social sciences, see ch. 29 ("Local models versus global theories, and their assessment") of this handbook. For a general review of models in science, see Frigg and Hartmann (2006). 7 See ch. 15 ("Rational choice and its alternatives") of this handbook. 8 See ch. 16 ("Game theory") of this handbook. 9 See ch. 17 ("Evolutionary approaches") of this handbook. 4 Interestingly, there are other methods that cannot be attached to one specific science. Network theory is a case in point: As networks are studied in almost all sciences, parallel developments took place, and much can be learned by exploring achievements in other fields (Jackson 2008).10 Having listed a large number of methods, the question arises which method is appropriate for a certain problem. This question can only be answered on a case by case basis, and it is part of the ingenuity of the scientist to pick the best method. But let us stress the following: While some scientists ask themselves which problems they can address with their favorite method, the starting point should always be a specific problem. Once a problem is chosen, the scientist picks the best method that helps solving it. To have some choice, it is important that scientists are acquainted with a variety of different methods. Mathematics and related disciplines provide the scientist with a toolbox (to use a popular metaphor) out of which they have to pick an appropriate tool. 2. Why Mathematize the Social Sciences? A historically important reason for the mathematization of the social sciences was that mathematics is associated with precision and objectivity. These are (arguably) two requirements any science should satisfy, and so the mathematization of the social sciences was considered a crucial step for the transformation of the social sciences into real science. Some such view has been defended by many authors. Luce and Suppes (1968), for example, provide a similar argument for the importance of theoretical axiomatization in the social sciences. Here, mathematics is used to precisely formulate a theory. By doing so, the latter's structure becomes transparent, and the relationships that hold between the various variables can be clearly specified or inferred. Above all, mathematics provides clarity, generality, and rigor. There are many ways to represent a theory. For long, philosophers have championed the syntactic view, requiring theories to be represented in first-order logic; or the semantic view in its various forms, identifying a theory with the collection of its models (Balzer et al 1987; Suppes 2000). While such reconstructions may be helpful for devising a consistent version of a theory, it usually suffices for all practical purposes to state a set of equations that constitute the mathematical part of the theory. The pioneers of the mathematization in the social sciences also developed measurement theory (Krantz et al. 1971), that takes as its starting point the idea that science is crucially about measurement.11 Contrary to this tradition, it has been argued that the subject matter of the social sciences does not require the level of precision demanded by the natural sciences, and that the social sciences are, and should, rather 10 See also ch. 18 ("Networks") of this handbook. 11 This view can be traced back to Kelvin's dictum "...when you can measure what you are speaking about and can express it in numbers, you know something shut it; but when you cannot measure it, when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind." See Merton et al (1984). It gave rise to much controversy in the philosophy of social science, reflecting deeper issues in the philosophy of mind and metaphysics. 5 be inexact (cf. Hausman 1992). After all, what works in the natural sciences may well not work in the social sciences. While Sir Karl Popper, one of the towering figures in the methodology of social science, did not promote the mathematization of the social sciences in the first place, it is clear that it nevertheless plays an enormous role in his philosophy. Given his focus on prediction and falsifiability (Hands 2008), it makes sense to prefer a theory that is mathematized to a theory that is not. This is due to the fact that it is generally much easier to obtain falsifiable conclusions from clearly stated propositions than from vague and informal claims. It is a mistake, however, to overestimate the role of mathematics in the social sciences. At the end, mathematics provides the social scientist only with tools, and the result of using these tools will crucially depend on explicit or implicit assumptions. This is a variant of the well-known GIGO principle from computer science ("garbage in, garbage out"). All assumptions are informally motivated. Formulating them in the language of mathematics just helps putting them more precisely. Once the assumptions are formulated mathematically, the machinery of mathematics helps to draw inferences in an automated way. This holds for analytical calculations as well as for numerical studies, including computer simulations (Frigg and Reiss 2010; Hartmann 1996). This brings us to another advantage of mathematical methods in the social sciences. While non-formal theories often remain rather simplistic and highly idealized, formal theories can be made increasingly complicated and realistic, reflecting the messiness of our world. The mathematical machinery then helps to draw inferences which could not be obtained without them (Humphreys 2004). Often, different assumptions of a theory or model pull in opposite directions, and it is not clear which one will be 'stronger' in a specific situation. However, when implemented in a mathematical model, it can be calculated what happens in which part of the parameter space. And so the availability of powerful computers allows the systematic study of more realistic models. There is, however, also a danger associated with this apparent advantage. Given the availability of powerful computers, scientists may be tempted to construct very complex models. But while these models may do well in terms of empirical adequacy, it is not so clear that they also provide understanding. This is often provided by rather simple models (sometimes called 'toy models'), i.e., models that pick only one crucial aspect of a system and help us get a feel for its implications.12 There are several other reasons for mathematization in the social sciences: a. Theory Exploration. Once a theory is represented in mathematical terms, the mathematical machinery can be employed to derive its qualitative and quantitative consequences. This helps to better understand what the theory is all about and what it entails about the world. The deductive consequences of the theory (and additional assumptions that have to be made) can be divided into retrodictions or predictions. For retrodictions, the question arises which 12 For more doubts about some of the uses of simulations in the social sciences, see Humphreys (2004). 6 additional assumptions have to be made to obtain a certain (already measured) value of a variable. b. Theory Testing. The predictions of a mathematically formulated theory can then be used to test the theory by confronting its consequences with relevant data. At the end, the theory will be confirmed or disconfirmed, or to put in Popperian terms, 'corroborated' or 'falsified'. c. Heuristics. Once the mathematical structure of a theory is apparent, a look at it may reveal analogies to other phenomena. This may inspire additional investigations, and lead to a better understanding of the class of phenomena under investigation. Also, a numerical study of a theory may suggest new patterns that can be incorporated into the assumptions of another theory. d. Explanation. While it is controversial what a scientific explanation is, it is clear that – once the theory is mathematically formulated – a phenomenon can be fitted into a larger theoretical framework (as the unification account demands) or a causal story can be read off from it (Kitcher 1989; Strevens 2009; Woodward 2005). This list suggests the existence of interesting parallels between the use of mathematics in the natural and the social sciences. Indeed, mathematization has similar functions in both kinds of sciences. There are further parallels: In both kinds of sciences, we find a variety of methods ranging from the axiomatic method to the use of computer simulations. Moreover, the models that are constructed range, in both kinds of sciences, from toy models to models that fit large amounts of data (e.g., in econometrics). The latter is achieved with the help of statistical methods, which we will discuss in the following sections. The similarities (and dissimilarities!) between the use of mathematics in the natural and social sciences are in need of further philosophical exploration. We hope that futures research will shed more light on these questions. 3. The Development of Statistical Reasoning Statistical reasoning is nowadays a central method of the social sciences. First, it is indispensable for evaluating experimental data, e.g., in behavioral economics or experimental psychology. For instance, psychologists might want to find out whether men act, in a certain situation, differently from women, or whether there are causal relationships between violent video games and aggressive behavior. Secondly, the social sciences heavily use statistical models as a modeling tool for analyzing empirical data and predicting future events, especially in econometrics and operational research, but recently also in the mathematical branches of psychology, sociology, and the like. For example, time series and regression models relate a number of input (potential predictor) variables to output (predicted) variables. Sophisticated model comparison procedures try to elicit the structure of the data--‐generating process, eliminate some variables from the model, select a 'best' model and, finally, fit the parameter values to the data. Still, the conception of statistics as an inferential tool is quite young: Throughout the 19th century, statistics was mainly used as a descriptive tool to summarize 7 data and fit models. While, in inferential statistics, the focus lies on testing scientific hypotheses against each other, or quantifying evidence for or against a certain hypothesis, descriptive statistics focuses on summarizing data and fitting the parameters of a given model to a set of data. The most famous example is maybe Gauss' method of the least squares, a procedure to center a data set (xn, yn) around a straight line. Other important descriptive statistics are contingency tables, effect sizes, and tendency and dispersion measures. Descriptive statistics were, however, "statistics without probability" (Morgan 1987), or as one might also say, statistics without uncertainty. In the late 19th and early 20th century, science was believed to be concerned with certainty, with the discovery of invariable, universal laws. This left no place for uncertain reasoning. Recall that, at that time, stochastic theories in the natural sciences, such as statistical mechanics, quantum physics, or laws of inheritance, were still quite new, or not yet invented. Furthermore, there was a hope of reducing them to more fundamental, deterministic regularities, e.g., to take the stochastic nature of statistical mechanics as an expression of our imperfect knowledge, uncertainty, and not as the fundamental regularities that govern the motion of molecules. Thus, statistical modeling contradicted the nomothetic ideal (Gigerenzer 1987), inspired by Newtonian and Laplacean physics, of establishing universal laws. Therefore, statistics was considered a mere auxiliary, imperfect device, a mere surrogate for proof by deduction or experiment. For instance, the famous analysis of variance (ANOVA) obtained its justification in the nomothetic view through its role in causal inference and elucidating causal laws. Interestingly, these views were held even in the social sciences, although the latter dealt with a reality that was usually too complex to isolate causal factors in laboratory experiments. Controlling for external impacts and confounders poses special problems to the social sciences, whose domain are not inanimate objects, but humans. The search for deterministic, universal laws in the social sciences might thus seem futile – and this is probably the received view today. Yet, in the first half of the 20th century, many social scientists thought differently. Statistics was needed to account for measurement errors and omitted causal influences in a model. But it was thought to play a merely provisional role: "statistical devices are to be valued according to their efficacy in enabling us to lay bare the true relationship between the phenomena under consideration. An ideal method would eliminate all of the disturbing factors." (Schultz 1928, 33) Thus, the view of statistics was eliminativist: As soon as it has done the job and elucidated the laws at which we aim, we can dismiss it. In other words, the research project consisted in eliminating probabilistic elements, instead of discovering statistical laws and regularities or modeling physical quantities as probabilistic variables with a certain distribution. This methodological presumption, taken from 19th century physics, continued to haunt social sciences far into the first half of the 20th century. Economics, as the "physics of social sciences", was particularly affected by that conception (Morgan 2002). 8 In total, there are three main reasons for inferential statistics' recognition as a central method of the social sciences: 1. The advances in mathematical probability, as summarized in the seminal work of Kolmogorov (1933/56). 2. The inferential character of many scientific questions, e.g., about the existence of a causal relationship between variables X and Y. There was a need for techniques of data analysis that ended up with an inference or decision, rather than with a description of a correlation. 3. The groundbreaking works by particular pioneer minds, such as Tinbergen and Haavelmo in economics (Morgan 1987). The following sections investigate the different ways in which inferential statistics has been spelled out, with a focus on the most prominent school in modern social science: Fisher's method of significance testing. 4. Significance Tests and Statistical Decision Rules One of the great conceptual inventions of the founding fathers of inferential statistics was the sampling distribution (e.g., Fisher 1935). In the traditional approach (e.g., classical regression), there was no need for the concept of a sample drawn from a larger population. Instead, the modeling process directly linked the observed data to a probabilistic model. In the modern understanding, the actual data are just a sample drawn out of a much larger, hypothetical population about which we want to make an inference. The rationale for this view consists in the idea that scientific results need to be replicable. Therefore, we have to make an inference about the comprehensive population (or the data--‐ generating process, for that matter) instead of making an 'in sample'--‐inference, whose validity is restricted to the particular data we observed. This idea of a sampling distribution proved crucial for what is known today as frequentist statistics. That approach strongly relies on the idea of the sampling distribution, outlined in the seminal works of Fisher (1925, 1935, 1956) and Neyman and Pearson (1933, 1967), parting ways with the classical accounts of Bayes, Laplace, Venn and others. In frequentist statistics, there is a sharp division between approaches that focus on inductive behavior, such as the Neyman--‐Pearson school, and those that focus on inductive inference, such as Fisherian statistics. To elucidate the difference, we will present both approaches in a nutshell. Neyman and Pearson (1933) developed a behavioral framework for deciding between two competing hypotheses. For instance, take the hypothesis H0 that a certain learning device does not improve the students' performance, and compare it to the hypothesis H1 that there is such an effect. The outcome of the test is interpreted as a judgment on the hypothesis, or the prescription to take a certain action ("accept/reject H0"). They contrast two hypotheses H0 and H1 and develop testing procedures such that the probability of erroneously rejecting H0 in favor of H1 is bounded at a certain level α, and that the probability of erroneously rejecting H1 in favor of H0 is, given that constraint, as low as possible. In other 9 words, Neyman and Pearson aim at maximizing the power of a test (i.e., the chance of a correct decision for H1) under the condition that the level of the test (the chance of an incorrect decision for H1) is bounded at a real number α. Thus, they developed a more or less symmetric framework for making a decision between competing hypotheses, with the aim of minimizing the chance of a wrong decision. While such testing procedures apply well to issues of quality control in industrial manufacturing and the like, the famous biologist and statistician Ronald A. Fisher (1935, 1956) argued that they are not suitable for the use in science. First, a proper behaviorist, or decision--‐theoretic, approach has to determine costs for faulty decisions (and Neyman--‐Pearson do this implicitly, by choosing the level α of a test). This involves, however, reference to the purposes to which we want to put our newly acquired knowledge. For Fisher, this is not compatible with the idea of science as pursuit of truth. Statistical inference has to be "convincing to all freely reasoning minds, entirely independent of any intentions that might be furthered by utilizing the knowledge inferred" (Fisher 1956, 103). Second, in science, a judgment on the truth of a hypothesis is usually not made on the basis of a single experiment. Instead, we obtain some provisional result which is refined through further analysis. By their behavioral rationale and by making a 'decision' between two hypotheses, Neyman and Pearson insinuate that the actual data justify a judgment on whether H0 or H1 is true. Such judgments have, according to Fisher, to be suspended until further experiments confirm the hypothesis, ideally using varying auxiliary assumptions and experimental designs. Third, Neyman and Pearson test a statistical hypothesis against a definite alternative. This leads to some seemingly paradoxical results. Take, for instance, the example of a normal distribution with known variance σ2 = 1 where the hypothesis about the mean H0: μ = 0 is tested against the hypothesis H1: μ = 1. If the average of the observations centers, say, around --‐5, it appears that neither H0 or H1 should be 'accepted'. Nevertheless, the Neyman--‐Pearson rationale contends that, in such a situation, we have to accept H0 because the discrepancy to the actual data is less striking than with H1. In such a situation, when H0 offers a poor fit to the data, such a decision is arguably weird. Summing up, Fisher disqualifies Neyman and Pearson's decision--‐theoretic approach as a mathematical "reinterpretation" of his own significant tests, that is utterly inappropriate for use in the sciences. In fact, he suspects that Neyman and Pearson would not have come up with their approach, had they had "any real familiarity with work in the natural sciences" (Fisher 1956, 76). Therefore, he developed a methodology of his own which proved extremely influential in the natural as well as the social sciences. His first two books, Statistical Methods for Research Workers (1925) and The Design of Experiments (1935) quickly went through many reprints and shaped the applications of statistics in the sciences for decades. The core of his method is the test of a point null hypothesis, or significance test. The objective here is to tell chance effects from real effects. To this end, we check whether a null (default, chance) hypothesis is good enough to fit the data. For instance, we want to test the effects of a new learning device on students' performance, and we start with the default assumption that the new device yields no improvement. If that hypothesis is apparently incompatible with 10 the data (if the results are 'significant'), we conclude that there is some effect in the treatment. The core of the argument consists in Fisher's Disjunction: "Either an exceptionally rare chance has occurred, or the theory [= the null hypothesis] is not true."(Fisher 1956, 39) In other words, the occurrence of a result that is very unlikely to be a product of mere chance (students using the device scoring much better than the rest) strongly speaks against the null hypothesis that there is no effect. Significant findings under the null suggest that there is more than pure chance involved, that there is some kind of systematic effect going on. As we will see below, this disjunction should be regarded with great caution, and it has been the source of many confusions and misunderstandings. Figure 1. Left figure: The null hypothesis H0: N(0,1) (full line) is tested at the 5%--‐level against the alternative H1: N(1,1) (dashed line). Right figure: a Fisherian significance test of H0 against an unspecified alternative. The shaded areas represent the set of results where H0 is rejected in favor of H1, respectively where the results speak 'significantly' against H0. Figure 1 illustrates the difference between Neyman--‐Pearsonian and Fisherian tests for the case of testing hypotheses on the mean value of a Normal distribution. The probability p(x) := P (T > T(x) | H0) gives the significance level which the observed value x achieves under H0, with respect to a function T that measures distance from the null hypothesis H0.. The probability p(x) is also often called the p--value induced by x, and is supposed to give a rough idea of the tenability of the null. The higher the discrepancy, the more significant the results. The rationale underlying Fisher's Disjunction displays a striking similarity to Karl Popper's falsificationist philosophy of science: A hypothesis H0, which should be as precise and ambiguity--‐free as possible, is tested by checking its observational implications. If our observations contradict H0, we reject it and replace it by another hypothesis. However, this understanding of falsificationism 11 only applies to testing deterministic hypotheses. Observations are never incompatible with probabilistic hypotheses, they are just very unlikely. Therefore, Popper (1959, 191) expanded the falsificationist rationale by saying that we regard a hypothesis H0 as false when the observed results are improbable enough. This is exactly the rationale of Fisher's Disjunction. Notably, Fisher formulated these ideas as early as Popper, and independently of him. The methodological similarity between Popper and Fisher's views becomes even more evident in the following quote: "[...] it should be noted that the null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis." (Fisher 1935, 19) This denial of positive confirmation of the null by non--‐significant results fits well not only with Popper's view on confirmation and corroboration, but also with a more modern textbook citation: "Although a significant departure [from the null] provides some degree of evidence against a null hypothesis, it is important to realize that a 'nonsignificant' departure does not provide positive evidence in favor of that hypothesis. The situation is rather that we have failed to find strong evidence against the null hypothesis." (Armitage and Berry 1987, 96) Thus, the symmetry of the Neyman--‐Pearsonian approach is broken: While Neyman--‐Pearson tests end up 'accepting' either hypothesis (and building action on the basis of this decision), Fisherian significance tests understand a significant result as strong evidence against the null hypothesis, an insignificant result does not mean evidence for the null. The attentive reader might have noticed that Fisher's Disjunction is actually inconsistent with his own criticism of the Neyman--‐Pearson approach. Recall that Fisher argued that significant outcomes do not deliver final verdicts on the feasibility of the null hypothesis. Rather, they state provisional evidence against the null. But how is this compatible with the idea of 'disproving the null' by means of significance tests? To reconcile both positions, Fisher has to admit some abuse of language: "[...] if we use the term rejection for our attitude to such a [null] hypothesis, it should be clearly understood that no irreversible decision has been taken; that as rational beings, we are prepared to be convinced by future evidence that [...] in fact a very remarkable and exceptional coincidence had taken place." (Fisher 1959, 35) In light of these ambiguities, it does not surprise that Fisher's writings have been the source of many misunderstandings, and that scientists sometimes use fallacious practices or interpretations while believing that these practices have been authorized by a great statistician. Before describing the problems of significance tests, however, we would like to shed light on the contrast between 12 frequentist statistics, which comprises Fisher's approach as well as the Neyman--‐ Pearson paradigm, and the rivaling school of Bayesian statistics. 5. Fisher versus Bayes Bayesian inference is a school of statistics with great significance for some theoretical branches of the social sciences, such as decision theory, game theory, and the psychology of human reasoning. Since the principles of Bayesian inference are explained in the chapter on decision theory, we restrict ourselves to a brief outline of the basic idea. Bayesian statistics is, essentially, a theory of belief revision: Prior beliefs on the credibility of a hypothesis H are represented by mathematical probabilities, modified in the light of incoming evidence E and transformed into posterior beliefs (represented by a conditional probability, P(H|E)). The relevant formula that expresses how these beliefs are changed is Bayes's Theorem: P(H|E) = P(H) P(E|H) / P(E) = P(H) P(E|H) / [(P(E|H) P(H) + P(E|~ H) P(~H)] = [1 + ((1 --‐ P(H))/P(H)) ⋅ (P(E|~H)/P(E|H))]--‐1. Thus, the sampling distributions of E under H and ~H are combined with the prior probability of H in order to arrive at a comprehensive verdict on the credibility of H in the light of evidence E. Modern philosophers of statistics – but also scientists themselves – have stressed the contrast between frequentist and Bayesian inference, depicting them as mutually exclusive schools of statistics (Howson and Urbach 2006; Mayo 1996). The polemics which both Bayesians and frequentists use to mock their respective opponents adds to the image of statistics as a deeply divided discipline where two enemy camps are quarreling about the right foundations of inductive inference. In particular, Bayesians have been eager to point out the limitations and shortcomings of frequentist inference for scientific applications, such as in the seminal paper of Edwards, Lindman and Savage (1963). Notably, this influential methodological contribution appeared not in a statistics journal, but in Psychological Review! On the other hand, frequentist criticisms of Bayesian inference read equally harshly. These heated debates do not do justice to the intentions of the founding fathers, who were often more pragmatic than one might retrospectively be inclined to think. Take the case of Ronald A. Fisher. Although Fisher is correctly perceived as one of the founding fathers of frequentist inference, it is wrong to see him as an anti--‐Bayesian. True, Fisher objects to the use of prior probabilities in scientific inference. But it is important to see why and under which circumstances. In principle, he says, there is nothing wrong with using Bayes' formula to revise one's belief. It is just practically impossible to base a sound scientific judgment on them. For how shall we defend a specific assignment of prior beliefs vis--‐à--‐vis our fellow scientists if they are nothing more than psychological tendencies? Most often, there is no knowledge available on which we could base specific 13 prior beliefs (1935, 6--‐7; 1956, 17). That said, Fisher speaks very respectfully about Bayes and his framework: Bayesian inference may be appropriate in science if genuine prior knowledge is available (1935, 13), and he admits the rationality of the subjective probability interpretation in spite of his own inclination to view probabilities as relative frequencies. It is therefore important to note that the debate between frequentist (here: Fisherian) and Bayesian statistics is not in the first place a debate about the principles of inductive inference in general, but a debate about which kind of inference is more appropriate for the purposes of science. The following section will cast some doubts on the appropriateness of pure, unaided significance testing in the social sciences. 6. The Pitfalls of Significance Testing The practice of significance tests has been dominating experiments in the social sciences for more than half a century. Journal editors and referees ask for significance tests and p--‐values (quantities describing the level of significance), standardizing experimental reports in a wide variety of branches of science (econometrics, experimental psychology, behavioral economics, etc.). Alternative approaches, e.g., the application of Bayesian or likelihoodist statistics to the evaluation of experiments, have little chance of being published. These publication practices in the last decades are at odds with the existence of a long methodological debate on significance testing in the social sciences (e.g. Rozeboom 1960). In that debate, statisticians and social scientists – mostly mathematically educated psychologists – have repeatedly criticized the misuse of significance tests in the evaluation and interpretation of scientific experiments. Before going into the details of that debate, we briefly list some apparent advantages of significance testing. a. Objectivity. Significance tests avoid the subjective probabilities of Bayesian statistics. Thereby, the observed levels of significance seem to be an objective standard for evaluating the experiment, e.g., for telling a chance effect from a real one. b. No Alternative Hypotheses. Significance tests are a means of testing a single, exact hypothesis, without specifying a certain direction of departure (i.e., an alternative hypothesis). Therefore, significance tests detect more kinds of deviation from that hypothesis than Neyman--‐ Pearson tests do. c. Replicability. Significance tests address the issue of replicability – namely the significance level can be understood as the relative frequency of observing a more extreme result if (i) the null hypothesis were true and (ii) the trial were repeated very often. d. Practicality. Significance tests are easy to implement, and significance levels are easy to compute. 14 However, it is not clear whether these advantages of significance tests are really convincing. We discuss some objections. Fisher's Disjunction revisited. The original example which Fisher used to motivate his famous disjunction was the hypothesis that the stars are evenly distributed in the sky, i.e., the chance that a star is in a particular area of the sky is proportional to the size of that area. Thus, if there are a lot of stars next to a particular star, such an event is unlikely to happen due to chance. Indeed, clusters of stars are frequently observed. According to Fisher's Disjunction, we may rule out the hypothesis of uniform distributions and conclude that stars tend to cluster. However, Hacking (1965, 81--‐82) has convincingly argued that such an application of Fisher's Disjunction is fallacious. Under the hypothesis of uniform distribution, every constellation of stars is extremely unlikely, and there are no likely vs. unlikely chances, but only 'exceptionally rare chances'. If Fisher's Disjunction were correct, we would, independent of the outcome, always have to reject the hypothesis of uniform distribution. This amounts to a reductio of significance testing, since, clearly, hypotheses that postulate a uniform distribution are testable, and they often occur in scientific practice. To circumvent Hacking's objection, we might interpret Fisher's Disjunction in a different way. For instance, we could read the `exceptionally rare chance' as a chance that is exceptionally rare compared to other possible events, instead of `a probability lower than a fixed value p'. Still, this does not help us in the present problem, because the uniform distribution postulates that all star constellations are equally likely or unlikely. Thus, the notion of a relatively rare chance ceases to apply (Royall 1997, 65--‐68). One might concede Hacking's objection for this special case and try to rescue significance tests in general by introducing a parameter of interest, μ. This is a standard situation in statistical practice. For instance, let's take a coin flip model which has 'heads' and 'tails' as possible outcomes, and where the parameter μ denotes the propensity of the coin to come up heads. Under the null hypothesis H0: μ = 0.5, all sequences of heads and tails are equally likely, but still, it is ostensibly meaningful to say that `HHHHHTTTTT' or `THTHTHTHTH' provides less evidence against H0 than 'HHHHHHHHHH' does. The technical concept for implementing this intuition consists in calculating the chance of a transformation of the data that is a minimally sufficient statistic with respect to the parameter of interest μ, such as the number of heads or tails. Then we get the desired result that ten heads, but not five heads vs. five tails (in whatever order) constitute a significant finding against H0. Thus, there is no exceptionally rare chance as such – any such chance is relative to the choice of a parameter that determines the way in which the data are exceptional. This line of reasoning fits well with the above example, but it introduces implicit alternative hypotheses. When relativizing unexpectedness to a parameter of interest, we are committing ourselves to a specific class of potential alternative hypotheses – namely those hypotheses that correspond to the other parameter 15 values. When applying Fisher's Disjunction, we do not judge the tenability of H0 `in general', without recourse to a specific parameter of interest or a class of alternatives. We always examine a certain way the data could deviate from the null. Thus, we are not testing the probability model H0 as such, but a particular aspect thereof, such as `why that value of μ rather than another one?'. The choice of a parameter reveals a class of intended alternatives.13 This has some general morals: What makes an observation evidence against a hypothesis is not its low probability under this hypothesis, but its low probability compared to an alternative hypothesis. An improbable event is not evidence against a hypothesis per se, but "[...] what it does show is that if there is any alternative hypothesis which will explain the occurrence of the sample with a more reasonable probability [...] you will be very much more inclined to consider that the original hypothesis is not true." (William S. Gosset ('Student') in private communication to Egon Pearson, quoted in Royall 1997, 68.) Thus, Fisher's Disjunction and the inference from relatively unlikely results to substantial evidence is caught in a dilemma: Either we run into the inconsistencies described above, or the choice of the test statistic reveals implicit alternatives to which the hypothesis is compared. Then, the falsificationist heuristics of Fisher's Disjunction has to be replaced by an account of contrastive testing. Then, it is unclear to what extent the Fisherian framework of significance testing can claim any advantage vis--‐à--‐vis Neyman and Pearson's tests of two competing hypotheses. The Base Rate Fallacy. Gigerenzer (1993) famously characterized the inner life of a scientist who uses statistical methods by means of an analogy from psychoanalysis: There is a Neyman--‐Pearsonian Super--‐Ego, a Fisherian Ego and a Bayesian Id. The Neyman--‐Pearsonian Super--‐Ego preserves a couple of unintuitive insights, e.g., that we cannot test a theory without specifying alternatives, that significance tests only give us the probability of data given a hypothesis instead of an assessment of the hypothesis' credibility. The Bayesian Id is located at the other end of the spectrum, incorporating the researcher's desire for posterior probabilities of a hypothesis, as a measure of its tenability or credibility. The Ego is caught in the conflict between these extremes, and acts as the scientist's guide through reality. It adopts a Fisherian position where both extremes are kept in balance: Significance test neither give behavioral prescription, nor posterior probabilities. Rather, they yield "a rational and well--‐ defined measure of reluctance to the acceptance of the hypotheses they test" (Fisher 1956, 44). However, the Bayesian Id sometimes breaks through. As pointed out by Oakes (1986) and Gigerenzer (1993), most active researchers in the social sciences – 13 There is no canonical class of alternatives: we could plausibly suspect that the coin has an in-built mechanism that makes it come up with alternating results, and then, `THTHTHTHTH' would not be an insignificant finding, but speak to a high degree against the chance hypothesis. 16 even those with statistical education – tend to interpret significance levels (e.g. p = 0.01) as posterior probabilities of the null hypothesis, or at least as overwhelming evidence against the null. Why is this inference wrong? Assume we want to test a certain null hypothesis against a very implausible alternative, e.g., that the person under test has a very rare disease. So, the null denotes absence of the disease. Now, a highly sensitive test, that is right about 99.9% of the time, indicates presence of the disease, yielding a very low p--‐value. Many people would now we tempted to conclude that the person probably has the disease. But since that disease is rare, the posterior probability of the null hypothesis can still be very large. In other words, evidence that speaks to a large degree against the null is not sufficient to support a judgment against the null. It would only do so if the null and the alternative were about equally likely at the outset. Such a failure to recognize the dependency between the base rate of the null hypothesis and the strength of the final evidential judgment is called the base rate fallacy. Although that fallacy is severe and widespread (and similar misinterpretations of significance tests abound, see Gigerenzer 2008), those fallacies might speak more against the practice of significance testing than against significance tests themselves. In any case, they invite misinterpretations, especially because p--‐ values (significance levels) are hard to related to scientifically meaningful conclusions.14 The Replicability Fallacy. This fallacy is more subtle than the base rate fallacy. It does not interpret p--‐values as posterior probabilities, but understands a p--‐ value of, say, 0.05 as saying that if the experiment were repeated, a result that was at least as significant as the present observations would occur at 95% of the time. Thus, the outcome is believed to have implications for the recurrence of a significant result and for the replicability of the present observations. And replicability is, needless to say, one of the main quality brands of good experiments. In principle, there is nothing wrong with connecting replicability to significance testing. But a crucial premise is left out – namely that the replication frequency holds only under the assumption that the null hypothesis is true. Since the power of many significance tests is low, implying that nonsignificant results often occur when the null is actually false, the kind of replicability that significance tests ensure is much more narrow than desired (Schmidt and Hunter 1997). A solution to this problem that has gained more and more followers in the last decades is to replace significance levels by confidence intervals that address the issue of replicability regardless of whether the null hypothesis is actually true. The Jeffrey--Lindley Paradox. This problem sheds light on the importance of sample size in statistical testing, and applies to both Fisher's and Neyman and Pearson's framework. For a large enough sample, a point null hypothesis can be 14 See Casella and Berger (1987) and Sellke and Berger (1987) for more detailed discussions of the evidential value of p-values in different testing problems. 17 rejected at a significant level, while the posterior probability of the null approaches one (Lindley 1957). Take, for instance, a normal model N(0,1) where we test the value of the mean, H0: μ = 0, against an alternative, H1: μ = 1. Since the sampling distribution of the mean of n samples approaches N(0, 1/n), any slight deviation of the mean from the null hypothesis will suffice to make the result statistically significant. Even more, if we decide to sample on until we get significant results against the null hypothesis, we will finally get them (Mayo and Kruse 2001). At the same time, the posterior of the null hypothesis also converges to 1 with increasing n, as long as the divergence remains rather small. Thus, for large samples, significance levels do not reliably indicate whether or not a certain effect is present, and can grossly deviate from the hypothesis' posterior credibility. Significance tests may tell us whether there is evidence against a point null hypothesis, but they do not tell us whether that effect is large enough to be of scientific interest. Statistical versus Practical Significance. Typically, the null hypothesis denotes an idealized hypothesis, such as "there is no difference between the effects of A and B". In practice, no one believes such a hypothesis to be literally true. Rather, everyone expects there to be differences, but perhaps just at a minute degree: "The effects of A and B are always different – in some decimal place – for some A and B. Thus asking `Are the effects different?' is foolish." (Tukey 1991, 100) However, even experienced scientists often read tables in an article by looking out for asterisks: One asterisk denotes "significant" findings (p < 0.05), two asterisks denote "highly significant" (p < 0.01) findings. It is almost impossible to resist the psychological drive to forget about the subtle differences between statistical and scientific significance, and many writers exploit that fact: "All psychologists know that statistically significant does not mean plain--‐English significant, but if one reads the literature, one often discovers that a finding reported in the Results sections studded with asterisks becomes in the Discussion section highly significant or very highly significant, important, big!" (Cohen 1994, 1001) Instead, statistical significance should at best mean that evidence speaks against our idealized hypothesis while we are still unable to give the direction of departure or the size of the observed effect (Kirk 1996). This provisional interpretation is in line with Fisher's own scepticism regarding the interpretation of significance tests, and Keuzenkamp and Magnus' (1995) observation that significance testing in econometrics rarely leads to the dismissal of an economic theory, and its subsequent replacement. Finally, under the assumption that null hypotheses are strictly spoken wrong, it is noteworthy that significance tests bound the probability of erroneously rejecting the null while putting no constraints on the probability of erroneously accepting the null, i.e., the power of a test. Considerations of power, sample size and effect size that are fundamental in Neyman and Pearson's approach fall out 18 of the simplified Fisherian picture of significance testing. This is not to say that these tests are worthless: For instance, in econometrics, a series of significance tests can be very useful to detect whether a model of a certain process has been misspecified. Significance tests look for directions in different departures (autocorrelation, moving average, etc.), and significant results provide us with reasons to believe that our model has been misspecified, and make us think harder about the right form of the model that we want to use in future research (Mayo and Spanos 2004; Spanos 1986). In that spirit, it should be stressed once more that Fisher considered significance tests to be a preliminary, exploratory form of statistical analysis that gives rise to further investigation, not to final decisions on a hypothesis. But reading social science journals, it is not always clear that the practicing researchers are aware of the problem. The penultimate section briefly sketches how this problem was addressed in the last decades. 7. Recent Trends The criticisms of significance testing have led many authors to conclude that significance tests do not help to address scientifically relevant questions. Using them in spite of their inability to address the relevant questions only invites misuse and confusion (Cohen 1994; Schmidt 1996). Since the problem and its discussion was especially pronounced in experimental psychology, we focus on the reactions in that field. Recognizing that those criticisms were justified, the American Psychological Association (APA) appointed a Task Force on Statistical Interference (TFSI) whose task consisted in investigating controversial methodological issues in inferential statistics, including significance testing and its alternatives (Harlow et al. 1997; Thompson 1999a; Wilkinson et al. 1999). After long deliberation, the Task Force gave with some recommendations that made the APA change their publication guidelines, and affected major journals affiliated to the APA, such as Psychological Review. The commission stated, for instance, that p--‐values do not reflect the significance or magnitude of an observed effect, and "encouraged" authors to provide information on effect size, either by means of directly reporting an effect size measure (e.g., Pearson's correlation coefficient r or Cohen's effect size measure d), or power and sample size of the test. However, as predicted by Sedlmeier and Gigerenzer (1989), and observed by a large body of empirical studies on research practice (e.g. Keselman et al. 1998), the admonitions and encouragements of the APA publication manual proved to be futile. First, psychologists were not trained at computing and working with effect sizes. Second, "there is only one force that can effect a change, namely the editors of the major journals" (Sedlmeier and Gigerenzer 1989, 315). Encouragement was likely to be ignored when compared to the compulsory requirements when submitting a manuscript and abiding by formatting guidelines: 19 "To present an `encouragement' in the context of strict absolute [manuscript] standards [...] is to send the message `these myriad requirements count, this encouragement doesn't'." (Thompson 1999b, 162) However, the extensive methodological debate finally seems to bear fruit. As pointed out by Vacha--‐Haase et al. (2000), several editors changed their policy, requiring the inclusion of effect size measures, where unwillingness to comply with that guideline had to be justified in a special note. This development, though far from overturning and eliminating all fallacious practices, shows that sensitivity for the issue has increased, and raises hope for the future. Also, Bayesian methods (and other approaches, such as Royall's (1997) likelihoodism) gain increasing acceptance beyond purely technical journals. Such inferential methods can now, to an increasing extent, also be found in major psychology journals. Finally, there is an increasing amount of journals that address a readership that is interested in mathematical and statistical modeling in the social sciences, as well as in methodological foundations. Although the presentation and interpretation of statistical findings in the social sciences is still wanting, there is some reason for optimism: The problems have been discovered and addressed, and we are now in the phase where a change towards a more reliable methodology is about to be effectuated. As stated by Cohen (1994), this change is slowed down by the conservativeness of many scientists, and their desire for automated inferential mechanisms. But such 'cooking recipes' do, as the drawbacks of significance tests teach us, not exist. 8. Summary Let us conclude the present chapter. In this contribution, we have surveyed and classified a variety of mathematical methods that are used in the social sciences. We have argued that such techniques, in spite of several methodological objections, can add extra value to social scientific research. Then, we have focused on methodological issues in statistics, i.e., the part of mathematics that is most frequently used in the social sciences, in particular in the design and interpretation of experiments. We have represented the emergence of and rationale behind the ubiquitous significance tests, and explained the pitfalls to which many researches fall prey when using them. Finally, after comparing significance testing to rivaling schools of statistical inference, we have discussed recent trends in the methodology of the social sciences, argued that there is reason for optimism, and that awareness of methodological problems, as well as interest for mathematical and statistical techniques is growing.15 References 15 Thanks to Paul Humphreys and Alex Rosenberg for helpful comments on an earlier draft of this chapter. 20 Anderson, P., K. Arrow, and D. Pines (eds.) (1988): The Economy As An Evolving Complex System. Redwood City: Addison--‐Wesley. Armitage, P. and G. Berry (1987): Statistical Methods in Medical Research. Second Edition. New York: Springer. Arrow, K. et al (eds.) (1960): Mathematical Methods in the Social Sciences. Stanford: Stanford University Press. Backhouse, R. (1995): A History of Modern Economic Analysis. Oxford: Blackwell. Balzer, W., C.U. Moulines and J. Sneed (1987): An Architectonic for Science: The Structuralist Program. Dordrecht: Reidel. Balzer, W. and B. Hamminga (eds.) (1989): Philosophy of Economics. Dordrecht: Kluwer. Beed, C. and O. Kane (1991): What Is the Critique of the Mathematization of Economics? Kyklos 44 (4): 581–612. Berger, J.O., and T. Sellke (1987). Testing a Point Null Hyphoteses: The Irreconciliability of p--‐Values and Evidence (with Discussion). Journal of the American Statistical Association 82: 112–122. Bermúdez, J.L. (2009): Decision Theory and Rationality. Oxford: Oxford University Press. Brems, H. (1975): Marshall on Economics. Journal of Law and Economics 18: 583585. Casella, G., and R. L. Berger (1987): Reconciling Bayesian and Frequentist Evidence in the One--‐Sided Testing Problem. Journal of the American Statistical Association 82: 106--‐111. Cohen, J. (1994): The Earth is Round (p < .05). American Psychologist 49: 997– 1001. Debreu, G. (1991): The Mathematization of Economic Theory. American Economic Review 81(1), 1–7. Edwards, W., H. Lindman and L.J. Savage (1963): Bayesian Statistical Inference for Psychological Research. Psychological Review 70: 450--‐499. Fisher, R.A. (1925): Statistical Methods for Research Workers. Edinburgh: Oliver and Boyd. Fisher, R.A. (1935): The Design of Experiments. Edinburgh: Oliver and Boyd. Fisher, R.A. (1956): Statistical Methods and Scientific Inference. New York: Hafner. Fogel, R.W. (1975): The Limits of Quantitative Methods in History, American Historical Review 80(2): 329–50. Frigg, R. and S. Hartmann: Models in Science. In: The Stanford Encyclopedia of Philosophy, (Spring 2006 Edition). Frigg, R. and J. Reiss (2009). The Philosophy of Simulation: Hot New Issues or Same Old Stew? Synthese 169 (3): 593--‐613. Gaertner, W. (2006): A Primer in Social Choice Theory. New York: Oxford University Press. Gallegatti, M., S. Keen, T. Lux, and P. Ormerod (2006): Worrying Trends in Econophysics, Physica A 370: 1--‐6. Gilbert, N. and K. Troitzsch (2005): Simulation for the Social Scientist. New York: McGraw--‐Hill. Gigerenzer, G. (1987): Probabilistic Thinking and the Fight against Subjectivity, in L. Krüger et al. (eds.): The Probabilistic Revolution, vol. 2: Ideas in the Sciences. Cambridge/MA: MIT Press, 11–33. 21 Gigerenzer, G. (1993): The Superego, the Ego, and the Id in Statistical Reasoning. In G. Keren and C. Lewis (Eds.), A Handbook for Data Analysis in the Behavioral Sciences: Methodological Issues. Hillsdale, NJ: Erlbaum, 311–339. Gigerenzer, G. (2008): Rationality for Mortals: How Humans Cope with Uncertainty. Oxford: Oxford University Press. Goodman, S. (1999): Toward Evidence--‐Based Medical Statistics. 1: The P Value Fallacy. Annals of Internal Medicine 130: 995--‐1004. Granger C. (1999): Empirical Modelling in Economics: Specification and Evaluation. Cambridge: Cambridge University Press. Grodon, S. (1991): The History and Philosophy of Social Science. London: Routledge. Hacking, I. (1965): Logic of Statistical Inference. Cambridge: Cambridge University Press. Harlow, L., S. Mulaik, and J. Steiger (eds.) (1997): What If there Were No Significance Tests? Mahwah/NJ: Erlbaum. Hands, W. (2008): Popper and Lakatos in Economic Methodology. In: D. Hausman (ed.), The Philosophy of Economics: An Anthology. Cambridge: Cambridge University Press, 188–203. Hartmann, S. (1996): The World as a Process: Simulations in the Natural and Social Sciences, in: R. Hegselmann et al. (eds.), Modelling and Simulation in the Social Sciences from the Philosophy of Science Point of View. Dordrecht: Kluwer, 77–100. Hausman, D. (1992): The Inexact and Separate Science of Economics. Cambridge: Cambridge University Press. Hausman, D. (ed.) (2008): The Philosophy of Economics: An Anthology. Cambridge: Cambridge University Press. Hegselmann, R. et al. (eds.) (1996): Modelling and Simulation in the Social Sciences from the Philosophy of Science Point of View. Dordrecht: Kluwer. Howson, C., and P. Urbach (2006): Scientific Reasoning: The Bayesian Approach. Third Edition. Open Court, La Salle. Humphreys, P. (2004): Extending Ourselves: Computational Science, Empiricism, and Scientific Method. Oxford: Oxford University Press. Jackson, M. (2008): Social and Economic Networks. Princeton: Princeton University Press. Keselman, H.J., et al. (1998): Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA and ANCOVA analyses. Review of Educational Research 68: 350–386. Keuzenkamp, H., and J. Magnus (1995): On Tests and Significance in Econometrics. Journal of Econometrics 67: 5–24. Kirk, R. (1996): Practical Significance: A Concept whose Time Has Come. Educational and Psychological Measurement 56: 746–759. Kolmogorov, A. N. (1933/56): Foundations of the Theory of Probability. New York: Chelsea. Original work published in German in 1933. Krantz, D.H., R.D. Luce, P. Suppes, and A. Tversky (1971): Foundations of Measurement. Vol. I. Additive and Polynomial Representations. New York: Academic Press. Krüger, L., G. Gigerenzer, and M. Morgan (eds.) (1987): The Probabilistic Revolution, Vol. 2: Ideas in the Sciences. Cambridge/MA: The MIT Press. Lindley, D. (1957): A Statistical Paradox. Biometrika 44: 187–192. 22 List, C. and C. Puppe (2009): Judgment Aggregation: A Survey. In: P. Anand, C. Puppe and P. Pattaniak (eds.), Oxford Handbook of Rational and Social Choice. Oxford: Oxford University Press, 457–482. Luce, R.D. and P. Suppes (1968): Mathematics. In: International Encyclopedia of the Social Sciences, Vol. 10. New York: Macmillan and Free Press, 65–76. Luce, R.D., D.H. Krantz, P. Suppes, and A. Tversky (1990). Foundations of Measurement. Vol. III. Representation, Axiomatization, and Invariance. New York: Academic Press. Mantegna, R. and H. Stanley (1999): An Introduction to Econophysics: Correlations and Complexity in Finance. Cambridge: Cambridge University. Marchi, S. de (2005): Computational and Mathematical Modeling in the Social Sciences. Cambridge: Cambridge University Press. Mayo, D. (1996): Error and the Growth of Experimental Knowledge. Chicago: Chicago University Press. Mayo, D. and M. Kruse (2001): Principles of Inference and Their Consequences. In: D. Corfield and J. Williamson (eds.): Foundations of Bayesianism. Dordrecht: Kluwer, 381--‐403. Mayo, D. and A. Spanos (2004): Methodology in Practice: Statistical Misspecification Testing. Philosophy of Science 71: 1007--‐1025. McCauley, J. (2004): Dynamics of Markets: Econophysics and Finance. Cambridge: Cambridge University Press. McCloskey, D. (2000): Other Things Equal: How To Be Scientific in Economics. Eastern Economic Journal 26: 241--‐246. Merton, R., D. Sills, and S. Stigler (1984): The Kelvin Dictum and Social Science: An Excursion into the History of an Idea. Journal of the History of the Behavioral Sciences 20: 319-331.Mirowski, P. (1990): More Heat than Light: Economics as Social Physics, Physics as Nature's Economics. Cambridge: Cambridge University Press. Mirowski, P. (2001): Machine Dreams: Economics Becomes a Cyborg Science. Cambridge: Cambridge University Press. Morgan, M. (1987): Statistics without Probability and Haavelmo's Revolution in Econometrics, in L. Krüger et al. (eds.): The Probabilistic Revolution, vol. 2: Ideas in the Sciences. Cambridge/MA: MIT Press, 171–197. Morgan, M. (2002): The History of Econometric Ideas. Cambridge: Cambridge University Press. Morrow, J. (1994): Game Theory for Political Scientists. Princeton: Princeton University Press. Neyman, J. and E. Pearson (1933): On the Problem of the Most Efficient Tests of Statistical Hypotheses. Philosophical Transactions of the Royal Society A 231, 289–337. Neyman, J. and E. Pearson (1967): Joint Statistical Papers. Cambridge: Cambridge University Press. Oakes, M. (1986): Statistical Inference. A Commentary for the Social and Behavioral Sciences. New York: Wiley. Piotrowski, E. and J. Sladkowski (2003): An Invitation to Quantum Game Theory, International Journal of Theoretical Physics 42: 1089–1099. Popper, K. R. (1959): The Logic of Scientific Discovery. London: Routledge. Rosenberg, A. (1976): Microeconomic Laws: A Philosophical Analysis. Pittsburgh: University of Pittsburgh Press. 23 Rosenberg, A. (1992): Economics – Mathematical Politics or Science of Diminishing Returns. Chicago: University of Chicago Press. Royall, R. (1997): Statistical Evidence – A Likelihood Paradigm. London: Chapman & Hall. Rozeboom, W.W. (1960): The Fallacy of the Null Hypothesis Significance Test, Psychological Bulletin 57: 416–428. Samuelson, P. (1952). Economic Theory and Mathematics –An Appraisal. American Economic Review 42: 56-69. Schmidt, F.L. (1996): Statistical significance testing and cumulative knowledge in psychology: Implications for training of researchers. Psychological Methods 1, 115–129. Schmidt, F.L., and J.E. Hunter (1997): Eight Common but False Objections to the Discontinuation of Significance Testing in the Analysis of Research Data, in Harlow et al. (eds.), What if there were No Significance Tests?, 37-64. Schultz, H. (1928): Statistical Laws of Demand and Supply with Special Application to Sugar. Chicago: University of Chicago Press, Chicago. Sedlmeier, P. and G. Gigerenzer (1989): Do Studies of Statistical Power Have an Effect on the Power of Studies? Psychological Bulletin 105: 309–316. Sen, A (1999): The Possibility of Social Choice, American Economic Review 89(3), 349–378. Senn, P. (1971): Social Science and Its Methods. Boston: Holbrook Press. Shapiro, S. (2000): Thinking about Mathematics: The Philosophy of Mathematics. Oxford: Oxford University Press. Simon, H. (1996): Models of My Life. Cambridge: MIT Press. Spanos, A. (1986): Statistical Foundations of Econometric Modelling. Cambridge: Cambridge University Press. Sugden, R. (2008): Credible Worlds: The Status of Theoretical Models in Economics. In: D. Hausman (ed.), The Philosophy of Economics: An Anthology. Cambridge: Cambridge University Press, 476–509. Suppes, P. (1967): What is a Scientific Theory? In: S. Morgenbesser (ed.): Philosophy of Science Today. New York: Basic Book, 55–67. Suppes, P., D.H. Krantz, R.D. Luce, and A. Tversky (1989): Foundations of Measurement. Vol. II. Geometrical, Threshold and Probabilistic Representations. New York: Academic Press. Suppes, P. (2001): Representation and Invariance of Scientific Structures. Chicago: University of Chicago Press. Sutter, D. and R. Pjesky (2007): Where Would Adam Smith Publish Today? The Near Absence of Math-free Research in Top Journals. Econ Journal Watch 4: 230-240. Thompson, B. (1999a): If statistical significance tests are broken/misused, what practice should supplement or replace them? Theory & Psychology 9: 167–183. Thompson, B. (1999b): Journal Editorial Policies Regarding Statistical Significance Tests: Heat Is to Fire as P Is to Importance. Educational Psychology Review 11: 151-169. Tukey, J.W. (1991): The Philosophy of Multiple Comparisons. Statistical Science 6, 100-116. Tversky, A. and D. Kahneman (1983): Extensional versus Intuitive Reasoning: The Conjunction Fallacy in Probability Judgment. Psychological Review 90: 293–315. 24 Vacha-Haase, T. et al. (2000): Reporting Practices and APA Editorial Policies Regarding Statistical Significance and Effect Size. Theory & Psychology 10: 413– 425. Vazquez, A. (1995): Marshall and the Mathematization of Economics. Journal of the History of Economic Thought 17: 247–265. Voit, J. (2000): The Statistical Mechanics of Financial Markets. Berlin: Springer. Waldrop, M. (1992): Complexity: The Emerging Science at the Edge of Order and Chaos. New York: Simon & Schuster. Wallace, D. (2010): A Formal Proof of the Born Rule from Decision-Theoretic Assumptions. To appear in: S. Saunders et al (eds.), Many Worlds? Everett, Quantum Theory, and Reality. Oxford: Oxford University Press. Weidlich, W. (2006): Sociodynamics: A Systemic Approach to Mathematical Modelling in the Social Sciences. New York: Dover Publications. Weintraub, R. (2002): How Economics Became a Mathematical Science. Durham: Duke University Press. Weintraub, R. (2008): Mathematics and Economics. In: S. Durlauf and L. Blume (eds.): The New Palgrave Dictionary of Economics. New York: Macmillan. Wigner, E.P. (1967): Symmetries and Reflections. Bloomington: University of Indiana Press. Wilkinson, L., and Task Force on Statistical Inference (1999): Statistical Methods in: Psychology Journals: Guidelines and Explanations. American Psychologist 54: 594–604.