1 Introductory comments

Probability theory featured prominently in Reichenbach’s thinking and work throughout his whole career, both as a conceptual tool and as a subject of philosophical investigation: Already in his doctoral thesis (1915) (published in English translation in 2008 Reichenbach (1915), see Padovani (2011) for a compact review of this translation) probability takes center stage in the form of a “principle of lawful distribution”. This principle states, roughly, that the empirical relative frequencies of occurrences of events converge to a true limit to be understood as probability. This principle has a transcendental status in Reichenbach’s theory of knowledge, similar to Kant’s principle of causality in Kant’s epistemology. The principle is transcendental because it is not empirically testable – rather it forms the basis of empirical science (Glymour and Eberhardt (2016)).

In his subsequent works, Reichenbach both applied probability theory in the analysis of specific philosophical problems and investigated the foundations of probability theory itself. An example of the former is Reichenbach’s concept of common cause used in his The Direction of Time Reichenbach (1956). This notion has had a lasting impact on the analysis of causality and has been the conceptual predecessor of the Causal Markov Condition in the modern theory of Bayes nets (see Hitchcock and Rédei (2020) for a review, and Hofer-Szabó et al. (2013), Wronski (2014) for detailed analyses of Reichenbach’s notion and the related principle of the common cause).

Reichenbach’s foundational work on probability theory culminated in the substantial, almost 500 page long monograph published in 1949 Reichenbach (1949), which is a re-worked version of the one published in German in 1935 Reichenbach (1935). In his 1949 work, Reichenbach gave both a formal axiomatization of probability theory and attempted to provide a foundation for it in the sense of the frequency view of probabilities. Both ideas rely on his earlier work; in particular the axiomatization in Reichenbach (1949) is based on the paper published in 1932 Reichenbach (1932).

The general assessment (Glymour and Eberhardt (2016), Eberhardt and Glymour (2009)) of Reichenbach’s axiomatization is that now it only has historical significance because Kolmogorov’s axiomatization published in 1933 Kolmogorov (1956) overshadowed it and became the mainstream. This verdict is based in Eberhardt and Glymour (2009) on a detailed critical analysis of Reichenbach’s axiomatization and of his related concept of probability logic. We agree with this assessment. In Sect. 2 we recall some general features of Reichenbach’s axiomatization and provide some more critical comments. On the positive side, in Sect. 2, we also show that Reichenbach’s analysis contains an important idea that in principle opens up the road to an axiomatization in the sense of Kolmogorov: The idea is to regard as mathematical probability theory what is isomorphic in different interpretations. But this avenue remains unexplored in Reichenbach’s work, which we claim is mainly due to ambiguities in the Reichenbachian axiomatic system. In Sects. 3 and 4 we explore the role of isomorphism from the perspective of foundations of probability theory. While it is not clear what the notion of isomorphism in the Reichenbachian axiomatization would be, there are very natural notions of isomorphism in the Kolmogorovian axiomatization. Kolmogorov himself did not make use of them in his foundational book, but they became standard. In Sect. 3 we recall these notions of isomorphisms and formulate what we call the Maxim of Probabilism: The idea that a concept, reasoning, argument is probabilistic only if it is invariant with respect to the isomorphisms of the mathematical structures that are models of the axioms. In Sects. 4 and 5 we illustrate the usefulness of the Maxim of Probabilism by using it to clarify some neuralgic points in connection with conditioning with respect to probability zero events, in particular in the context of the Borel-Kolmogorov Paradox.

2 Comments on Reichenbach’s axiomatization of probability theory

Reichenbach distinguished three approaches to axiomatization of probability theory: One that aims at an

[\(\ldots \)] interpreted form of axiomatic construction [\(\ldots \)] which regards probability, from the very beginning, as a frequency, and derives from this interpretation, by the possible inclusion of additional postulates, the rules of the theory. Reichenbach (1949)[p. 121]

The second

[\(\ldots \)] formal conception introduces the concept of probability by the method of implicit definitions, and uses no properties of the concept other than those expressed in a set of formal relations placed as axioms in the beginning of the theory, leaving open various possibilities for its interpretation. Reichenbach (1949)[p. 121]

The third approach

[\(\ldots \)] connects the treatment of probability with the methods of symbolic logic. [\(\ldots \)] constructing probability as a relation between statements, which includes logical implication as a special case. Reichenbach (1949)[p. 122]

Reichenbach classifies both Kolmogorov’s approach and his 1932 axiomatization as belonging to the second group. While this is certainly true for the Kolmogorov approach, it is not unambiguously true for his 1932 axiomatization: While a formal axiomatization in the sense of the second approach is given in his 1932 axiomatization, Reichenbach also introduces a “coordinating definition \(\alpha \)” [p. 591] that relates the formal probabilistic formula to limits of relative frequencies. On this basis he distinguishes two notions of mathematical probability:

We call the resulting notion of probability, i.e. the concept that is determined by the axiom system including the coordinating definition \(\alpha \), the contentual mathematical concept of probability, in contrast to the formal mathematical concept of probability determined exclusively by the axioms, i.e. without assigning content to it.Footnote 1 Reichenbach (1932)[p. 592] (emphasis in original)

Thus, although a purely formal axiomatization describing the formal mathematical concept of probability is part of Reichenbach’s treatment indeed, Reichenbach’s analysis is coupled to a frequency view even when it comes to a mathematical specification of the concept. There is thus no sharp separation in the mathematical sphere of the concept of probability from the frequency view in his 1932 axiomatization—in contrast to what he claims about his own axiomatization. This ambiguity prohibits Reichenbach, we claim below, to develop an idea that potentially leads to an axiomatization based on measure theory.

Furthermore, in his Reichenbach (1932), Reichenbach sees the need for a further axiom to be added to the mathematical axioms: This is the Axiom of InductionReichenbach (1932)[p. 614]. This axiom is precisely the principle of lawful distribution that appeared in his 1915 dissertation – as Reichenbach explicitly acknowledges, citing his PhD dissertation (Reichenbach (1932)[p. 614], especially footnote 24 in Reichenbach (1949)). This axiom has a status that is conceptually different from those that specify the mathematical probability because it is not part of mathematics: it postulates the applicability of probability theory in the sense of the frequency view.

Coupling probability theory to the frequency interpretation, Reichenbach follows a well-established tradition, of which he is fully aware:

In order to develop the frequency interpretation, we define probability as the limit of a frequency within an infinite sequence. The definition follows a path that was pointed out by S.D. Poisson in 1837. In 1854 it was used by George Boole, and in recent times it was brought to fore by Richard von Mises, who defended it successfully against critical objections. Reichenbach (1949)[p. 68] (emphasis in original)

The key difference between Reichenbach’s frequentism and von Mises’ concept of probability as limit of relative frequency is that Reichenbach abandons von Mises’ requirement of randomness of the infinite sequence in which relative frequencies are supposed to be calculated. For von Mises it is not enough that the limits of frequencies in the infinite sequence exist: The infinite sequence, the “ensemble”(von Mises calls it “Kollektiv”) must also be disorderly, “random” (Mises (1919), Mises (1928)[p. 23]). Von Mises specified the content of randomness of an ensemble by requiring invariance of the limits of relative frequencies in the ensemble with respect to place selections: Selecting an infinite sub-ensemble of the original ensemble by a rule, the limits of relative frequencies in the sub-ensemble should be equal to the limits of the relative frequencies in the original ensemble Mises (1928)[p. 23]. According to von Mises, this invariance should hold for any place selection determined by a rule that does not involve the random event whose frequency one calculates. Given this concept of randomness, the problem of its consistency arises: Do random ensembles exist at all? Mises (1928)[pp. 88–89] recalls the reasoning that consistency cannot be proved in the strict sense of mathematical proof: An infinite sequence can only be specified by a mathematical rule, which can in principle be used to select a sub-ensemble in which the frequencies differ from the one in the original sequence. But he rejects the position that one should restrict the class of place selections to a class for which consistency of the corresponding restricted randomness concept is provable, saying that for any conceivable sub-class “... it will be possible to indicate place selections” that are not in the class, and, consequently

It is not possible to build a theory of probability on the assumption that the limiting values of the relative frequencies should remain unchanged only for a certain group of place selections, predetermined once and for all. Mises (1928)[p. 90]

Ultimately, in the chapter “Consistency of the randomness axiom” in Mises (1928), von Mises seems to be content with what one could call “pragmatic consistency” of randomness, and which is based on results (due to Copeland and Wald) stating that for any countable set of place selections there exist random ensembles:

[\(\ldots \)] from what we know so far, it is certain that the probability calculus, founded on the notion of the collective, will not lead to logical inconsistencies, in any application of the theory known today. Mises (1928)[p. 91] (our emphasis)

In \(\S \)31 of Reichenbach (1949) Reichenbach reviews the main results on the consistency of randomness (especially the works by Copeland, Wald and Ville), but draws a conclusion from them that is different from von Mises’ pragmatic consistency:

The significance of the problem of the definition of random sequence should not be overestimated, however. Within the general calculus of probability, random sequences merely represent a special type [...] In actual applications, all kinds of probability sequences are encountered. Some show the features of randomness; others represent intermediate types between strictly ordered and random sequences. [...] It would constitute a rather narrow conception of probability if the name of probability sequences would be reserved for random sequences. Reichenbach (1949)[pp. 150–151]

Reichenbach intends to have his theory to be flexible enough to accommodate such “intermediate types” of infinite sequences, i.e. sequences which embody different degrees of order, not just the random ones:

An essential feature of my theory of order is that it deals with all possible forms of probability sequences and is not restricted to sequences of one type of order [...]. In this respect my probability theory differs from others—in particular from that developed by R. von Mises. Such theories regard randomness as an essential characteristic of the very concept of probability; and they contend that the meaning of probability cannot be exhaustively formulated without reference to randomness. Reichenbach (1949)[p. 132]

Concerning the classification of his 1949 attempt, Reichenbach writes: “My own presentation undertakes to unite the axiomatic method with the construction of logico-mathematical calculus\(\ldots \)” Reichenbach (1949)[p. 122]. Indeed Reichenbach’s axiomatization is a mixture of formal axiomatization in the sense of symbolic logic and of informal axiomatization in the sense of semi-formal mathematics – as axiomatization is done e.g. when groups are defined by the group axioms. The problem is that in Reichenbach’s treatment the syntax is not completely specified and no formal semantics is given (Eberhardt and Glymour (2009)[pp. 371–373]); and, viewed as axiomatization in the semi-formal sense of mathematics

[\(\ldots \)] these axioms are not sufficient to provide an axiomatization of probability, since they do not ensure that the space the probabilities are applied to is closed under complementation and countable union, i.e. that it forms a sigma-field. Eberhardt and Glymour (2009)[p. 371].

Hence, because of the lack of an explicit semantics and its clear separation from syntax, for a logician, Reichenbach’s axiomatization was too much informal mathematics; for a practicing mathematician, the formal logic involved in the axiomatization separated it too much from mainstream mathematics to be useful; and for a physicist interested in applying probability theory, Reichenbach’s axiomatization was too much logic, mathematics and philosophy altogether. For philosophers the axiomatization offered a target for philosophical criticism, which it received indeed (see section 5 in Eberhardt and Glymour’s paper Eberhardt and Glymour (2009) for a review of the main philosophical criticisms, and Peijnenburg and Atkinson (2011) for a defense of Reichenbach against a specific objection raised by C.I. Lewis).

But at a certain point Reichenbach comes very close to the idea of identifying mathematical probability theory with measure theory in the spirit of Kolmogorov: In Chapter 6 of Reichenbach (1949), Reichenbach discusses an “admissible interpretation” Reichenbach (1949)[p. 203] of the purely mathematical part of the axioms that is different from the frequency view: The geometrical interpretation. Reichenbach demonstrates in this chapter that (with one exception ! – see below) his axioms are satisfied by subsets of the two dimensional plane with probability identified with the normalized area measure. This idea is present already in the 1932 paper Reichenbach (1932)[\(\S \) 5], but it is more systematically developed and more explicitly stated in Reichenbach (1949):

The possibility of a geometrical representation of probabilities results from the considerations given in \(\S \) 40. By showing that both the frequency interpretation and the geometrical interpretation satisfy the axioms of the formal system of probability, that is, are interpretations of this system, we have demonstrated the isomorphism, or structural identity, of the two interpretations. Every operation carried out in terms of probability formulas entails analogous operations in the frequency interpretation and the geometrical interpretation. Any derived probability relation is, therefore symbolized in the geometrical interpretation by those geometrical relations that have been specified above for the geometrical interpretation of the probability concept. Reichenbach (1949)[pp. 207–208] (emphasis in original)

Reichenbach even sees that the isomorphism holds if the two dimensional plane is replaced by a higher dimensional Euclidean space with its Lebesgue measure: “The foregoing considerations can easily be generalized for an attribute space of more than two dimensions.” Reichenbach (1949)[p. 208]

So in his 1949 monograph Reichenbach is just one intellectual step away from saying that what is common in the frequency and geometrical interpretations is the measure-theoretic structure and the axioms should express exactly this. But this step is not taken and we see several reasons for this. One is that, as Reichenbach himself emphasizes (Reichenbach (1949)[p. 205]), one group of axioms (called “group v”, the “Axioms of the theory of order” Reichenbach (1949)[p. 137]), are not satisfied by the two-dimensional Lebesgue measure. This group is precisely the one that connects probability to the frequency view: “\(\ldots \) the axioms v, like the previous axioms, are valid for all probability sequences, for they could be derived from the frequency interpretation” Reichenbach (1949)[p. 139]. The two axioms that form this group express that if probabilities are limits of relative frequencies in infinite sequences then the sequences possess randomness in a limited sense of place selection. Reichenbach is aware that the presence of these axioms distinguishes his axiomatization from those – including Kolmogorov’s – that “... omit the development of the theory of the order of the probability sequences.” Reichenbach (1949)[p. 121] Reichenbach clearly regards this as a virtue of his axiomatization. But the conceptually not entirely sharp separation in Reichenbach’s axiomatization of the frequency interpretation from the purely formal axiomatization becomes a conceptual obstacle to draw the consequences of the isomorphism he recognized.

Another difficulty standing in the way of drawing the consequence of the described isomorphism concerns the unavoidable measure (hence probability) zero sets in the geometrical interpretation: Reichenbach thinks that this would need an additional axiom in probability theory Reichenbach (1949)[p. 207].

More generally, taking the step of isolating measure theory as the isomorphism-invariant structure presupposes being aware of the development of abstract measure theory, especially of the possibility of moving from the theory of Lebesgue’s measure towards abstract measures. Kolmogorov explicitly mentions in the introduction of his book this as a prerequisite for the conceptual move:

[\(\ldots \)] if probability theory was to be based on the above analogies [involving Lebesgue measure and integral] it still was necessary to make the theories of measure and integration independent of the geometric elements which were in the foreground with Lebesgue. Kolmogorov (1956)[p. v]

Doob (1996) mentions the following crucial steps in the creation of abstract measure theory that were needed for the Kolmogorovian axiomatization:

  • Lebesgue’s extension of volume in \(\mathrm{I\!R}^n\) to the Borel sets in \(\mathrm{I\!R}^n\) (1902).

  • Radon’s definition of a general measure on the Borel sets in \(\mathrm{I\!R}^n\) (1913).

  • Fréchet’s realization that one needs only a \(\sigma \)-algebra of subsets of a set for a meaningful measure theory with a \(\sigma \)-additive measure (1915).

It is perhaps understandable that a mathematician like Kolmogorov was more familiar with these developments in measure theory than the physicist-philosopher Reichenbach. Thus the full ramifications of the isomorphism seen by Reichenbach remain unexplored by him. But the idea of relating what is probability theory to what is isomorphism-invariant is a deep thought. The next section makes this idea explicit in the context of the Kolmogorovian axiomatization.

3 Isomorphism of probability measure spaces and the Maxim of Probabilism

In the Kolmogorovian specification, mathematical probability theory is a probability measure space \((X,{{\mathcal {S}}},p)\), where \({{\mathcal {S}}}\) is a Boolean \(\sigma \)-algebra of subsets of the set X (with respect to the standard set theoretical operations \(\cap ,\cup \) and complement \(A^{\bot }\)), and the probability p is a countably additive map from \({{\mathcal {S}}}\) into [0, 1]. Accepting this measure-theoretic specification of probability theory leads naturally to both the notion of isomorphism of probability measure spaces and the methodological ramification we call below the Maxim of Probabilism.

Since a probability measure space consists of three components, the set of elementary events, a Boolean algebra of general events and a probability measure, the notion of isomorphism is supposed to respect all these three components. Moreover, when it comes to defining isomorphism of probability measure spaces, the possible presence of probability zero events has to be taken into account; accordingly, there are two (inequivalent) notions of isomorphism: (i) strict isomorphism (also called point isomorphism) and (ii) isomorphism up to probability zero (also called isomorphism mod0 Bogachev (2007); the terminology almost isomorphism also is used). We define first the isomorphism of measurable spaces:

Definition 3.1

Given two measurable spaces \((X,{{\mathcal {S}}})\) and \((Y,{{\mathcal {Z}}})\), a bijection \(f:X\rightarrow Y\) is called an isomorphism between \((X,{{\mathcal {S}}})\) and \((Y,{{\mathcal {Z}}})\), if both f and its inverse \(f^{-1}\) are measurable (establishing a Boolean-algebra isomorphism between \({{\mathcal {S}}}\) and \({{\mathcal {Z}}}\)). In this case \((X,{{\mathcal {S}}})\) and \((Y,{{\mathcal {Z}}})\) are called isomorphic via f.

Definition 3.2

Two probability measure spaces \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\) are called measure-theoretically strictly isomorphicBogachev (2007)[p. 275], if the measurable spaces \((X,{{\mathcal {S}}})\) and \((Y,{{\mathcal {Z}}})\) are isomorphic via some bijection \(f:X\rightarrow Y\), and the isomorphism between the Boolean \(\sigma \)-algebras \({{\mathcal {S}}}\) and \({{\mathcal {Z}}}\) determined by f preserves the probability measures p and q:

$$\begin{aligned} q(B)= & {} p(f^{-1}[B])\qquad \text{ for } \text{ all } \quad B\in {{\mathcal {Z}}}\end{aligned}$$
(1)
$$\begin{aligned} p(A)= & {} q(f[A]) \qquad \text{ for } \text{ all } \quad A\in {{\mathcal {S}}}\end{aligned}$$
(2)

For the definition of isomorphism mod0 of probability spaces, we need the following simple notion of reduction of probability spaces \((X,{{\mathcal {S}}},p)\): Let \(M\in {{\mathcal {S}}}\) be such that \(p(M)=1\). Let \({{\mathcal {S}}}_M\) be defined by

$$\begin{aligned} {{\mathcal {S}}}_M\doteq \{A\cap M : A\in {{\mathcal {S}}}\} \end{aligned}$$
(3)

then \({{\mathcal {S}}}_M\) is a Boolean \(\sigma \)-algebra, and one can define p on \({{\mathcal {S}}}_M\) to obtain a probability measure \(p_M\):

$$\begin{aligned} p_M(A\cap M)\doteq p(A\cap M) \qquad A\in {{\mathcal {S}}}\end{aligned}$$
(4)

\((M,{{\mathcal {S}}}_M,p_M)\) is then a probability measure space.

Definition 3.3

Two probability measure spaces \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\) are called measure-theoretically isomorphic mod0, if there are sets \(M\in {{\mathcal {S}}}\) and \(N\in {{\mathcal {Z}}}\) with \(p(M)=q(N)=1\) such that \((M,{{\mathcal {S}}}_M,p_M)\) and \((N,{{\mathcal {Z}}}_N,q_N)\) are strictly isomorphic. We call the strict isomorphism between \((M,{{\mathcal {S}}}_M,p_M)\) and \((N,{{\mathcal {Z}}}_N,q_N)\) a mod0 isomorphism between \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\).

Embracing the Kolmogorovian specification of probability theory as a triplet \((X,{{\mathcal {S}}},p)\) leads naturally to what we call: Maxim of Probabilism: A concept/claim/property/reasoning/argument is probabilistic only if it is invariant with respect to measure-theoretic isomorphisms between probability measure spaces.

To be more precise, one can distinguish two senses of the adjective “probabilistic” in connection with the Maxim of Probabilism: A weak and a strong, depending on the notion of isomorphism involved: A concept/claim/property/reasoning/argument is

  • weakly probabilistic only if it is invariant with respect to strict measure-theoretic isomorphisms between probability measure spaces;

  • strongly probabilistic only if it is invariant with respect to mod0 isomorphisms.

The Maxim of Probabilism provides a necessary condition for what probability is. In applications of probability theory the set of elementary random events are frequently modeled by a set X in which structures are defined in addition to the \(\sigma \)-field \({{\mathcal {S}}}\) (for instance: metric, topological, or order structures). As a consequence, reasonings in the context of \((X,{{\mathcal {S}}},p)\) might involve features of these—from a measure-theoretic viewpoint “surplus” —structures and thus the probabilistic reasonings get intertwined with considerations that are not in fact probabilistic. This molding of probabilistic and non-probabilistic elements in reasonings is potentially misleading because it might make invisible where precisely the probabilistic content lies. This can, in turn, lead to misguided questions and puzzles. The Maxim of Probabilism can in such situations be used to disambiguate the probabilistic and non-probabilistic components of reasonings and concepts: This Maxim tells us that an argument or a concept that is formulated in the context of a probability measure space \((X,{{\mathcal {S}}},p)\) cannot be regarded even as weakly probabilistic if it is such that it cannot be formulated also in a probability measure space with which \((X,{{\mathcal {S}}},p)\) is isomorphic via a strict isomorphism.

We will illustrate the usefulness of the Maxim of Probabilism in the next two sections on the example of clarifying certain conceptual perplexities concerning conditioning involving probability zero events; the illustration involves violation of the Maxim of Probabilism. Below we give two examples of concepts that are invariant with respect to isomorphisms; hence these examples illustrate how concepts can satisfy the Maxim of Probabilism. The two notions are: the correlation function and the feature of pure measure-theoretic non-atomicity of probability spaces. In both examples below it is assumed that \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\) are probability spaces and f is a mod0 isomorphism between \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\), i.e. f is a strict isomorphism between \((M,{{\mathcal {S}}}_M,p_M)\) and \((N,{{\mathcal {Z}}}_N,q_N)\).

Example: correlation function Each probability space \((X,{{\mathcal {S}}},p)\) determines a real-valued function \({\text {Corr}}_{(X,{{\mathcal {S}}},p)}:{{\mathcal {S}}}\times {{\mathcal {S}}}\rightarrow \mathbb {R}\) defined by

$$\begin{aligned} {\text {Corr}}_{(X,{{\mathcal {S}}},p)}(A, B) \doteq p(A\cap B)-p(A)\cdot p(B) \qquad A,B\in {{\mathcal {S}}}\end{aligned}$$

So we also have

$$\begin{aligned} {\text {Corr}}_{(Y,{{\mathcal {Z}}},q)}(A, B) \doteq q(A\cap B)-q(A)\cdot q(B) \qquad A,B\in {{\mathcal {Z}}}\end{aligned}$$

Let \((C_X,C_Y)\) be a pair of events with \(C_X\in {{\mathcal {S}}}\) and \(C_Y\in {{\mathcal {Z}}}\). Call this pair f-related if

$$\begin{aligned} f[(C_X\cap M)]= & {} C_Y\cap N \end{aligned}$$
(5)

Then, since \(M^{{\bot }}\) and \(N^{{\bot }}\) are p-measure (respectively q-measure) zero sets and f is a strict isomorphism between \((M,{{\mathcal {S}}}_M,p_M)\) and \((N,{{\mathcal {Z}}}_N,q_N)\), we have

$$\begin{aligned} p(C_X)= & {} p(C_X\cap M)\end{aligned}$$
(6)
$$\begin{aligned}&{=}&p_M(C_X\cap M)=q_N(f[C_X\cap M])=q_N(C_Y\cap N)\end{aligned}$$
(7)
$$\begin{aligned}= & {} q(C_Y) \end{aligned}$$
(8)

If \((A_X,A_Y)\) and \((B_X,B_Y)\) are both f-related, then applying (6)–(8) to \(C_X=A_X,B_X, (A_X\cap B_X)\) and \(C_Y=A_Y,B_Y, (A_Y\cap B_Y)\) we obtain

$$\begin{aligned} {\text {Corr}}_{(X,{{\mathcal {S}}},p)}(A_X, B_X) = {\text {Corr}}_{(Y,{{\mathcal {Z}}},q)}(A_Y, B_Y) \end{aligned}$$
(9)

Equation (9) means that the notion of a correlation function is invariant under mod0 isomorphisms; hence it satisfies the necessary condition to be strongly probabilistic in the spirit of the Maxim of Probabilism. Note that if \(f:X\rightarrow Y\) is a strict isomorphism between \((X, {{\mathcal {S}}},p)\) and \((Y, {{\mathcal {Z}}}, q)\), then (9) is simply

$$\begin{aligned} {\text {Corr}}_{(X,{{\mathcal {S}}},p)}(A, B) = {\text {Corr}}_{(Y,{{\mathcal {Z}}},q)}(f[A], f[B]) \end{aligned}$$
(10)

for all \(A,B\in {{\mathcal {S}}}\). The content of (10) is that taking the same definition of a correlation function in strictly isomorphic spaces yields the same function, and this means in particular that the notion of a correlation function is invariant under strict isomorphisms.

Example: measure-theoretic non-atomicity By definition, \((X,{{\mathcal {S}}},p)\) is measure-theoretically purely non-atomic if for any \(A\in {{\mathcal {S}}}\) with \(p(A)>0\) there is \(B\subset A\) such that \(p(A)> p(B) > 0\). We show that if \((X,{{\mathcal {S}}},p)\) is measure-theoretically purely non-atomic, and \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\) are isomorphic mod0, then \((Y,{{\mathcal {Z}}},q)\) is also measure-theoretically purely non-atomic. The proof relies on the fact that for any event \(A\in {{\mathcal {S}}}\), \(p(A) = p(A\cap M) = p_{M}(A\cap M)\) and similarly, for \(B\in {{\mathcal {Z}}}\) we have \(q(B) = q_{N}(B\cap N)\). It follows that \((X, {{\mathcal {S}}}, p)\) is purely non-atomic if and only if \((M, {{\mathcal {S}}}_M, p_M)\) is purely non-atomic. Suppose \((X,{{\mathcal {S}}},p)\) (and thus \((M, {{\mathcal {S}}}_M, p_M)\)) is purely non-atomic. The calculation below shows that the isomorphism f between \((M, {{\mathcal {S}}}_M, p_M)\) and \((N, {{\mathcal {Z}}}_N, q_N)\) preserves non-atomicity, and therefore \((Y, {{\mathcal {Z}}}, q)\) is purely non-atomic as well: Take an \(A\in {{\mathcal {Z}}}\) with \(q(A)>0\). Then \(q_N(A\cap N)>0\), hence \(p_M(f^{-1}[A\cap N]) > 0\). Using non-atomicity of \((M, {{\mathcal {S}}}_M, p_M)\) there is \(B\cap M\in {{\mathcal {S}}}_M\) such that \(B\cap M\subset f^{-1}[A\cap N]\) and \(p_M(f^{-1}[A\cap N])> p_M(B\cap M) > 0\). Now, f being an isomorphism ensures \(f[B\cap M]\subset A\cap N\) and \(q_N(A\cap N)> q_N(f[B\cap M])>0\), which completes the proof.

Measure theoretically purely non-atomic spaces are not rare: The Lebesgue measure on [0, 1] defines a purely non-atomic probability measure space. Moreover, we have

Proposition 3.4

(Walters (1982)[p. 55]) Every probability measure space \((X,{{\mathcal {S}}},p)\), where X is a complete metric space and \({{\mathcal {S}}}\) is the Borel \(\sigma \)-algebra, is isomorphic mod0 to the probability space [0, 1] with the Lebesque measure – if \((X,{{\mathcal {S}}},p)\) is purely non-atomic.

It is noteworthy that purely non-atomic probability spaces also have philosophically relevant features: they are common cause complete in the sense that they contain a common cause of every correlation \({\text {Corr}}_{(X,{{\mathcal {S}}},p)}(A, B)>0\), see Gyenis and Rédei (2011), Marczyk and Wronski (2015), Gyenis and Rédei (2014), Hofer-Szabó et al. (2013). Thus common cause completeness also satisfies the strong necessary condition to be probabilistic.

4 Conditioning and the Maxim of Probabilism

The general concept of conditioning in the measure-theoretic formalism is based on the notion of conditional expectation, which was introduced into probability theory by Kolmogorov in Kolmogorov (1956) together with his axiomatization. Given \((X,{{\mathcal {S}}},p)\), and a \(\sigma \)-subalgebra \({{\mathcal {A}}}\) of \({{\mathcal {S}}}\), a map

$$\begin{aligned} \mathscr {E}(\cdot \mid {{\mathcal {A}}}) :{{\mathcal {L}}}^1(X,{{\mathcal {S}}},p) \rightarrow {{\mathcal {L}}}^1(X,{{\mathcal {S}}},p) \end{aligned}$$
(11)

is an \({{\mathcal {A}}}\)-conditional expectation on the set of integrable real-valued random variables \({{\mathcal {L}}}^1(X,{{\mathcal {S}}},p)\) if (i) for all \(f\in {{\mathcal {L}}}^1(X,{{\mathcal {S}}},p)\), the function \(\mathscr {E}(f \mid {{\mathcal {A}}})\) is \({{\mathcal {A}}}\)-measurable; and (ii) it preserves the integral: \(\int _Z \mathscr {E}(f \mid {{\mathcal {A}}}) d p= \int _Z f\; dp\) for all \(Z\in {{\mathcal {A}}}\). It is important that the conditional expectation exists as a consequence of the Radon–Nikodym theorem but it is unique only up to p-probability zero. Conditional expectations that differ only on a p-probability zero set are called versions. \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})\) denotes the restriction of \(\mathscr {E}(\cdot \mid {{\mathcal {A}}})\) to (the characteristic functions of) \({{\mathcal {S}}}\). Conditional probabilities of random events \(B\in {{\mathcal {S}}}\) as real numbers are defined in this framework of conditioning in the following manner:

Let \(q_{{{\mathcal {A}}}}\) be a probability measure on the Boolean sub-\(\sigma \)-algebra \({{\mathcal {A}}}\) that is absolutely continuous with respect to the restriction of p to \({{\mathcal {A}}}\). Then \(q_{{{\mathcal {A}}}}\) yields a density function g (the Radon–Nikodym derivative), and the conditional probability \(q(B\mid {{\mathcal {A}}})\) of \(B\in {{\mathcal {S}}}\) on condition that the probabilities of events in \({{\mathcal {A}}}\) are given by the probability measure \(q_{{{\mathcal {A}}}}\) is, by definition

$$\begin{aligned} q(B\mid {{\mathcal {A}}})\doteq \int _X g \ \mathscr {P}(\chi _B\mid {{\mathcal {A}}})\ dp \end{aligned}$$
(12)

where \(\chi _B\) is the characteristic (indicator) function of B. It can be shown (see e.g. Gyenis and Rédei (2017)) that formula (12) reduces to the Jeffrey rule, if \({{\mathcal {A}}}\) is generated by a countable (measurable) partition of X, and that the formula (12) yields Bayes’ rule, if \({{\mathcal {A}}}\) is generated by one single event A on which \(q_{{{\mathcal {A}}}}\) takes value 1, provided \(q_{{{\mathcal {A}}}}\) is absolutely continuous with respect to p. Thus conditionalization using the notion of conditional expectation is a general form of Bayesian conditionalization.

One also finds in the literature a somewhat controversial interpretation of this kind of conditioning however: the value of the function \(\mathscr {P}(\chi _B\mid {{\mathcal {A}}})\) on \(x\in X\) is sometimes viewed as the “conditional probability of B on condition \(\{x\}\)”:

$$\begin{aligned} \underbrace{\mathscr {P}(\chi _B\mid {{\mathcal {A}}})(x)}_{\text{ conditional } \text{ probability } \text{ of } B \text{ on } \text{ condition } \{x\}} \qquad B\in {{\mathcal {S}}}\end{aligned}$$
(13)

Since it can happen that the p-probability of \(\{x\}\) is zero, \(p(\{x\})=0\), the formula (13) would yield then a conditional probability of B on the probability zero event \(\{x\}\); which is regarded as a major virtue of this “Kolmogorovian conditioning”.

It has been recognized in the mainstream literature on probability theory that this concept of conditional probability with respect to probability zero conditioning events is not unproblematic (Rao (2005)[p. 62]; Rosenthal (2006)[p. 153, 156]; “Difficulties and Curiosities” in Billingsley (1995)[pp. 437–439]). All the problems are related to the fact that, since \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})\) is the restriction of a version of the \({{\mathcal {A}}}\)-conditional expectation, \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})\) also is only a version: Different versions yield different values for \(\mathscr {P}(\chi _B\mid {{\mathcal {A}}})(x)\); in fact, if \(p(\{x\})=0\), then for any real number r (in particular any real number r in [0, 1]) there is a version such that this version yields r as the value of the “conditional probability of A on condition \(\{x\}\)”. Which values are then the “real” conditional probabilities on condition \(\{x\}\)?

Another problem is that for a fixed \(x\in X\) the map

$$\begin{aligned} {{\mathcal {S}}}\ni A\mapsto \mathscr {P}(\chi _A\mid {{\mathcal {A}}})(x)\in \mathrm{I\!R}\end{aligned}$$
(14)

is not a countably additive map on \({{\mathcal {S}}}\) in general (Rao (2005)[p. 47], Billingsley (1995)[pp. 438–439]). So for a fixed x the map (14) is not a probability measure on \({{\mathcal {S}}}\); hence the values given by equation (13) are not probabilities—if one takes seriously the Kolmogorovian specification of what probabilities are: They are given by a probability measure (which is countably additive). Saying that “\(\ldots \) conditional probabilities behave ‘essentially’ like ordinary probabilities” Rosenthal (2006)[p. 156] is just acknowledging that they are not probabilities.

One might want to say: The “real” conditional probabilities are the ones that are given by a version for which the map in Eq. (14) is countably additive. A conditional probability \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})\) is called regular if the map in (14) is countably additive for p-almost all x Rao (2005)[p. 46]. The problem is that such a version might not exist: there exist probability spaces for which this happens Billingsley (1995)[pp. 438–439; 443].

These difficulties are well known. The reactions to the difficulties are mixed. Billingsley disregards the difficulty, saying “\(\ldots \) it does not matter that conditional probabilities may not, in fact, be measures.” Billingsley (1995)[p.439]. One reason why he sees this dismissal as being justified is that one can show that \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})(x)\) is additive for an x that has positive p-measure Billingsley (1995)[p. 439]. But this is not helpful if one wishes to maintain that

The whole point of this Sect. [on conditional expectations] is the systematic development of a notion of conditional probability that covers conditioning with respect to events of probability 0. This is accomplished by conditioning with respect to collections of events – that is, with respect to \(\sigma \)-fields\(\ldots \) Billingsley (1995)[p. 432]

Rosenthal’s assessment Rosenthal (2006)[p. 153] of this “accomplishment” amounts to acknowledging that the goal has not been achieved.

Rao’s reaction:

All these studies show that conditioning in the general case is not simple, and the occasional counterexamples served only to deepen the mystery of the subject. Rao (2005)[p. 62]

The alleged mystery involved in conditioning via conditional expectations disappears naturally however if one keeps in mind the Maxim of Probabilism: That a concept is genuinely probabilistic only if it is invariant with respect to measure-theoretic isomorphisms:

Assume that \((X,{{\mathcal {S}}},p)\) and \((Y,{{\mathcal {Z}}},q)\) are strictly isomorphic via a strict isomorphism \(f:X\rightarrow Y\). Let \({{\mathcal {A}}}\) be a sub-\(\sigma \)-field of \({{\mathcal {S}}}\). Then \({{\mathcal {A}}}\) is taken by f into a sub-\(\sigma \)-field \(f[{{\mathcal {A}}}]\doteq \{f[A] : A\in {{\mathcal {A}}}\}\). Consider versions \(\mathscr {E}(\cdot \mid {{\mathcal {A}}})\) and \(\mathscr {E}(\cdot \mid f[{{\mathcal {A}}}])\) of the conditional expectations

$$\begin{aligned} \mathscr {E}(\cdot \mid {{\mathcal {A}}})&:&{{\mathcal {L}}}^1(X,{{\mathcal {S}}}, p)\rightarrow {{\mathcal {L}}}^1(X,{{\mathcal {S}}}, p) \end{aligned}$$
(15)
$$\begin{aligned} \mathscr {E}(\cdot \mid f[{{\mathcal {A}}}])&:&{{\mathcal {L}}}^1(Y,{{\mathcal {Z}}}, q)\rightarrow {{\mathcal {L}}}^1(Y,{{\mathcal {Z}}}, q) \end{aligned}$$
(16)

Assume that there is \(x_0\in X\) such that \(p(\{x_0\}=q(\{f(x_0)\}=0\). Then for a \(g\in {{\mathcal {L}}}^1(X,{{\mathcal {S}}},p)\) either

$$\begin{aligned} \mathscr {E}(g\mid {{\mathcal {A}}})(x_0)\not =\mathscr {E}(f\circ g\mid f[{{\mathcal {A}}}])(x_0) \end{aligned}$$

or, if

$$\begin{aligned} \mathscr {E}(g\mid {{\mathcal {A}}})(x_0)=\mathscr {E}(f\circ g\mid f[{{\mathcal {A}}}])(x_0) \end{aligned}$$

then we can take another version \(\mathscr {E}'(\cdot \mid {{\mathcal {A}}})\) such that

$$\begin{aligned} \mathscr {E}'(g\mid {{\mathcal {A}}})(x_0)\doteq \mathscr {E}(g\mid {{\mathcal {A}}})(x_0)+r \qquad \text{ for } r\not =0 \end{aligned}$$

and then

$$\begin{aligned} \mathscr {E}'(g\mid {{\mathcal {A}}})(x_0)\not =\mathscr {E}(f\circ g\mid f[{{\mathcal {A}}}])(x_0) \end{aligned}$$

This means that the concept of a particular version of the conditional expectation (or rather the value of a version at a given point) is not invariant with respect to strict isomorphisms.

But definition (12) does yield a unique probability value: whichever version of \(\mathscr {P}(\cdot \mid {{\mathcal {A}}})\) one takes in (12), since the p-integral is insensitive to p-measure zero differences, the conditional probability defined by (12) is the same. This can be expressed formally by stating that the conditional expectation is unique if considered as a map on the space of equivalence classes of integrable random variables, where the equivalence relation is “equal except on p-probability zero set”. Consequently, the conditional probability values provided by the (unique) conditional expectation lacks the ambiguity involved in versions. And this can also be expressed in terms of the Maxim of Probabilism: Using the fact that a mod0 isomorphism generates an isomorphism of the spaces of equivalence classes of integrable random variables, one can show (Gyenis and Rédei (2020)) that the conditional expectation is invariant with respect to mod0 isomorphisms – and so are the conditional probabilities defined by it in the manner of (12).

So the situation is the following:

  1. (i)

    The concept of conditional expectation viewed on the space of equivalence classes of functions satisfies the necessary condition for a concept to be strongly probabilistic.

  2. (ii)

    The notion of version of a conditional expectation does not satisfy the necessary condition for a concept to be even weakly probabilistic.

So the “mystery” Rao mentions is explained by the fact that specific versions of conditional expectations are not determined probabilistically, they are not purely probabilistic. There is no “canonical version” of a conditional expectation in general, choosing a particular version can only be motivated by considerations that involve non-probabilistic elements. This consequence of the Maxim of Probabilism also helps in understanding certain features of the Borel-Kolmogorov Paradox. We discuss this in the next section on the basis of Gyenis et al. (2017).

5 The Maxim of Probabilism and the Borel-Kolmogorov Paradox

The Borel-Kolmogorov Paradox arises from the question “What is the conditional probability on a great circle on a sphere in 3 dimension if on the sphere one assumes the uniform probability measure?” One might have the intuition that the conditional probability in question is determined and is the uniform probability. But the usual definition of conditional probability by the ratio formula (on which Bayes’ rule is based) does not yield any conditional distribution on the great circle because any great circle has probability zero in the uniform measure on the sphere. This tension between the intuition and the definition of conditional probability by the ratio formula is the Borel-Kolmogorov Paradox. It has been extensively discussed both in probability theory proper (see Kolmogorov (1956)[pp. 50–51], Billingsley (1995)[p. 441], Bungert and Wacker (2020), de Finetti (1972)[p. 203], Proschan and Presnell (1998), Rao (1988), Rao (2005)[p. 65], Seidenfeld et al. (2001)), and in the literature on philosophy of probability (see Borel (1909)[pp. 100–104], Easwaran (2008), Hájek (2003), Jaynes (2003)[p. 470], Howson (2014), Myrvold (2015), Gyenis et al. (2017)Footnote 2, Rescorla (2015), Seidenfeld (2001)).

Kolmogorov (1956)[pp. 50–51] argued that the paradox is resolved if one conditionalizes using the concept of conditional expectation: He specified a \(\sigma \)-field \({{\mathcal {A}}}\) on the sphere containing the great circle and calculated a version of the corresponding \({{\mathcal {A}}}\)-conditional expectation determined by the \(\sigma \)-field \({{\mathcal {A}}}\) and by the uniform probability on the sphere. This version yields a conditional probability on the great circle, but this conditional probability is not the uniform probability on the great circle. Although this is counterintuitive, it is a consequence of how Kolmogorov chose the \(\sigma \)-field \({{\mathcal {A}}}\); and one can show (Gyenis et al. (2017)) that choosing a different \(\sigma \)-field \({{\mathcal {B}}}\) one obtains a version of the corresponding \({{\mathcal {B}}}\)-conditional expectation that does yield the uniform conditional probability on the great circle. It was argued in Gyenis et al. (2017) that obtaining both a non-uniform and the uniform conditional probability on the great circle is not a contradiction because the respective two \(\sigma \)-fields \({{\mathcal {A}}}\) and \({{\mathcal {B}}}\) are not isomorphic, hence they represent different conditioning conditions. More importantly: both the uniform and the non-uniform conditional probability on the great circle are given by specific versions of the respective \({{\mathcal {A}}}\)- and \({{\mathcal {B}}}\)-conditional expectations. And since versions of the conditional expectations are not determined probabilistically (as shown in Sect. 4), neither the Kolmogorovian non-uniform conditional probability, nor the uniform conditional probability on the great circle are determined probabilistically by the facts that (i) one has the uniform probability on the sphere and (ii) one fixes as conditioning \(\sigma \)-fields \({{\mathcal {A}}}\) or \({{\mathcal {B}}}\).

This undeterminateness of the conditional probability on the great circle, even in the framework of conditioning via conditional expectations, is concealed by the deceptive determinatness of the versions of the \({{\mathcal {A}}}\)- and \({{\mathcal {B}}}\)-conditional expectations. This seeming determinatness of the versions is due to the fact that the sphere is a two-dimensional surface, and this allows integrating two-place functions on the sphere with respect to one variable. It is this particular structure of the probability space on the sphere that leads to a natural selection of a version of the \({{\mathcal {A}}}\)-conditional expectations featuring in Kolmogorov’s resolution – and also of the \({{\mathcal {B}}}\)-conditional expectations yielding the uniform conditional probability on the great circle (see Gyenis et al. (2017) for details). But linear dimension is not a property that is invariant with respect to measure-theoretic isomorphisms mod0: The probability measure space consisting of the two-dimensional sphere with the uniform probability on its Lebesgue measurable sets is a purely non-atomic probability space, with the sphere being a complete metric space; hence Proposition 3.4 applies, and so the sphere with its uniform probability is isomorphic mod0 with the unit interval with the Lebesgue measure on it. In this latter probability space there is no natural selection of a version of the conditional expectation that corresponds to the version of the \({{\mathcal {A}}}\)-conditional expectation Kolmogorov chose, nor is there a natural choice of a version of the conditional expectation that corresponds to the version of the \({{\mathcal {B}}}\)-conditional expectation that yields the uniform conditional probability on the great circle. The Maxim of Probabilism tells us then that selecting either the version in the Kolmogorov resolution or in the resolution yielding the uniform conditional probability on the great circle does not satisfy the necessary condition to be even weakly probabilistic: the selections involve non-probabilistic features of the situation.

The upshot is that the Maxim of Probabilism tells us that the conditional probability on any given great circle is probabilistically genuinely undetermined by the assumption of the uniform probability on the sphere. Tacit, non-probabilistic reasonings (e.g. symmetry considerations (Gyenis et al. (2017))) play a role in influencing our intuition that the conditional probability on a great circle is determined probabilistically by the uniform probability on the sphere. But the group-theoretic structure of the sphere on which the symmetry considerations are based is also not invariant with respect to measure-theoretic isomorphisms.

6 Concluding comments

The Maxim of Probabilism only gives a necessary condition to be satisfied by a concept in order to qualify as probabilistic. Why not sufficient as well? If one views mathematical concepts as originating in the attempts to describe natural and social phenomena (as for instance von Neumann saw mathematics (Rédei (2020))), then no sufficient and necessary conditions are feasible that relate a mathematical structure exclusively to a specific circle of phenomena because typically there is a large variety of phenomena whose main features are described by the same mathematical structure. This is so with (bounded) measure theory as well: A lot of diverse phenomena can be described mathematically in terms of bounded measure theory in addition to those that can be regarded as probabilistic in an intuitive sense.

Viewed from this empiricist perspective, Reichenbach’s attempt at axiomatizing probability theory aims at specifying a mathematical structure that is richer than measure theory, embodying extra content, the extra content being the frequency interpretation. This leads to the difficulty that a finite frequency interpretation is too constraining and one has to allow that probabilities are limits of relative frequencies in infinite ensembles – but this latter view does not have a direct empirical basis. So Reichenbach creates one artificially by formulating the transcendental (non-empirical) principle of lawful distribution (Axiom of Induction). By assigning a major function to this principle in foundations of probability theory Reichenbach moves away from empiricism; on the other hand, insisting on expressing the frequency content in the mathematical axioms of probability theory he tried to remain very close to an empiricist position. We regard this tension the fundamental reason for the difficulties in his foundational work on probability, which however is still a rich source of inspiration – as we hope the idea of utilizing the notion of isomorphism in foundations of probability shows.