Falsifiable =⇒ Learnable David Balduzzi, Victoria University of Wellington The paper demonstrates that falsifiability is fundamental to learning. We prove the following theorem for statistical learning and sequential prediction: If a theory is falsifiable then it is learnable – i.e. admits a strategy that predicts optimally. An analogous result is shown for universal induction. A theory that explains everything, [predicts] nothing. – attributed to Karl Popper. 0. INTRODUCTION To what extent are theory-based predictions justified by prior observations? The question is known as the problem of induction and is fundamental to scientific inference. We address the problem of induction from the perspective of learning theory. That is, we consider which theories, and under what assumptions, can be applied to make optimal predictions. Our main result is that the more hypotheses a theory falsifies, suitably quantified, the closer the predictive performance of the best strategy (based on the theory) will be to the theory's post hoc explanatory performance on observed data. 0.0. Non-technical overview (or, Learning theory for the working scientist) Learning theorists have characterized the generalization performance of algorithms in a wide range of scenarios. Although none of these scenarios adequately captures the practice of scientific inference, they form a family of minimal models of prediction. An intuitive understanding of the main results of learning theory therefore belongs in every scientist's conceptual toolkit. Unfortunately, the results are phrased in opaque terminology that depends on specialized concepts such as Rademacher complexity, shattering coefficients and VC-dimensions. This paper presents basic results from learning theory in terminology that is meaningful to the broader scientific community. The results cover three scenarios. In each scenario, Forecaster uses a theory (or theories) to predict Nature's next move(s) based on Nature's previous moves. S2. Statistical learning (SLT). Forecaster aims to predict events sampled from an unknown probability distribution based on a finite sample [Vapnik 1995; Boucheron et al. 2000; Bousquet et al. 2004]. S3. Sequential prediction (SEQ). Forecaster aims to predict events generated by an adversarial Nature that adapts to Forecaster's previous moves [Cesa-Bianchi and Lugosi 2006; Abernethy et al. 2009; Rakhlin et al. 2014]. S4. Universal induction (UNI). Forecaster aims to predict elements drawn from an arbitrarily chosen computable sequence [Solomonoff 1964; Hutter 2011]. The paper develops the following account. A. The risk. - The risk of a theory is how accurately it explains a sequence of events. A theory explains a sequence of events perfectly if it contains a predictor that correctly labels every element. In general, the accuracy of an explanation is the fraction of the sequence that its best predictor explains correctly. - The risk of a strategy is how accurately it predicts a sequence of events. A strategy specifies picks a predictor based on previously observed events, which it then applies to future events. The strategy's predictive accuracy is the fraction of 2 D Balduzzi future events that it labels correctly. B. Learnability. - The predictive risk (or regret) on a sequence is the difference between a strategy's predictive accuracy and the theory's explanatory accuracy:{ predictive risk } = { how well strategy predicts} − { how well theory explains } The predictive risk measures the strategy's effectiveness. It is not an absolute measure. Effectiveness is relative to a baseline – how well the theory explains the sequence in hindsight. Thus, the predictive risk quantifies the cost from not knowing what Nature will do next, independently of the cost of not having a good model of Nature. - A strategy is optimal if its predictive risk is asymptotically negligible on any sequence: { strategy optimal } if [ lim n→∞ { predictive risk } = 0 ] The definition of optimal is subtle. An optimal strategy does not necessarily predict accurately. Rather, it predicts about as accurately as the theory explains. - A theory is learnable if it admits an optimal strategy:{ theory learnable } if ∃ { optimal strategy } In other words, a theory is learnable if it admits a strategy that predicts future events as well as the theory explains them after the fact. C. Falsifiability. - The falsifiability of a theory is the fraction of effective hypotheses about a sequence that it cannot explain. Effective hypotheses are hypotheses about finite sequences. The set of effective hypotheses is necessarily finite. We measure falsifiability in two ways, soft and hard: F := 2 ∑ ε∈I ( fraction of effective hypotheses falsified ) * ( on fraction ε of data ) G := log -# of effective hypotheses that theory falsifies log # of effective hypotheses The two notions are, respectively, the expectation of a risk-induced distribution on errors and the risk's Bayesian information gain, see section 2.3. They are closely related to the statistical and sequential Rademacher complexities and covering numbers, and Kolmogorov complexity. - A theory is falsifiable if the fraction of effective hypotheses that it falsifies tends to one asymptotically.{ theory falsifiable } if [ lim n→∞ { falsifiability } = 1 ] The number of effective hypotheses grows exponentially with sequence length, so the requirement is quite weak. For example, a theory is falsifiable if the number of hypotheses it explains grows polynomially. D. Falsifiable =⇒ Learnable (SLT,SEQ). - Main theorem (qualitative). If a theory is falsifiable, then it is learnable:{ falsifiable} =⇒ { learnable } Falsifiable =⇒ Learnable 3 Alternatively, if a theory is falsifiable then it admits a strategy that predicts optimally – that is, a strategy that predicts any sequence as well, asymptotically, as the theory would have explained the sequence in hindsight. - Main theorem (quantitative).{ predictive risk } ≤ 1− { falsifiability } The quantitative version of the main theorem provides guarantees – across all sequences of some finite length n – on the expected performance of a theory's best strategy in terms of the falsifiability of the theory. The qualitative version is a corollary of the quantitative. E. Falsifiable =⇒ Learnable (UNI). Universal induction differs significantly from the other two scenarios. We reformulate Solomonoff induction to show that Forecaster constructs a nested sequence of theories in response to observations; from which predictors are drawn uniformly at random. Falsifiability is defined as above in this setting, but it admits a different interpretation:{ falsifiability } = { log -# hypotheses Forecaster eliminates whilst adapting theory } Importantly, Forecaster eliminates hypotheses prior to – and separately from – making predictions. - Main theorem (quantitative).{ predictive risk } ≤ { falsifiability } In short, the number of hypotheses eliminated (or falsified) by Forecaster whilst adapting its theory controls its predictive performance. 0.1. Outline of the paper and summary of the main contributions The paper is organized as follows. Section 1 introduces two basic tools: the induced distribution and the Bayesian information gain. When a function has a finite domain, a natural prior on the domain is the uniform distribution, in which case the induced distribution and information gain can be interpreted as different ways of counting elements in pre-images. The next three sections consider statistical learning, sequential prediction and universal induction in turn. The sections are variations on a basic template. The risk is the fundamental object in all three cases, Definition A in sections x.1 for x = 2, 3, 4. The risk is a function from sequences of events to errors that can be computed with respect to strategies or theories. In the first case, the risk quantifies predictive performance of the strategy; in the second, it quantifies explanatory performance of the theory in hindsight. The predictive risk is the (minimax) difference between predictive and explanatory performance, Definition B in sections x.2. An event is an ordered pair: a process acting on an input. The key step in the paper is to reformulate the risk as a function from hypothetical processes to errors, by fixing the input sequence. The risk is then a function with a finite domain. We propose two notions of falsifiability,1 Definition C in sections x.3. The first, soft falsifiability is the expected error under the risk-induced distribution on errors. Intuitively, it is a weighted sum of how many potential hypotheses are falsified over different fractions of the data. The second, hard falsifiability, is the risk's Bayesian information gain. Intuitively, it is the "log-fraction" of falsified hypotheses. 1Only hard falsifiability is relevant to universal prediction. 4 D Balduzzi The main result is that soft and hard falsifiability control the predictive risk in all three scenarios, Theorems D & E in sections x.4. Specifically, we show that falsifiability is equivalent to, or upper or lower bounds, the relevant measures of capacity: the statistical and sequential Rademacher complexities and covering numbers, and Kolmogorov complexity. The bounds on predictive risk then follow from standard results in learning theory [Boucheron et al. 2000; Bousquet et al. 2004; Hutter 2011; Rakhlin et al. 2014]. Proofs are collected in sections x.5. The conclusion discusses the results' implications for Popper's account of scientific inference and the problem of induction, section 5. The main contributions are: - Relating the formal models of prediction developed by learning theorists to how working scientists think about scientific inference. - Deriving falsifiability, and so the fundamental measures of capacity and complexity, as natural properties of the optimization problem at hand (the risk, Remark 2). - Unifying basic notions from information theory, learning theory, and algorithmic complexity under the rubric of falsifiability. The simplicity of the definitions and resulting theorems – along with the fact that they apply across diverse settings – suggest that falsifiability may be a more natural, flexible concept than capacity. 0.2. Related work Connections between falsifiability and statistical learning theory were pointed out in [Vapnik 1995; Harman and Kulkarni 2007; Corfield et al. 2009]. However, these works only considered VC dimension, which does not relate to falsifiability as directly as the measures introduced here. Moreover, they only considered the setting of statistical learning. Preliminary versions of this work were presented in [Balduzzi 2011; 2013]. 0.3. Notation We have endeavored to use similar notation for the three settings. Consequently, we have been forced to overload certain symbols. In particular, superscripts can refer to both Cartesian products, e.g. Xn = ∏n t=1X, and disjoint unions, e.g. Y • = ⋃∞ n=1 Y n. indicator function I unit interval [0,1] I 0/1 loss ` set of distributions on X ∆X expectation E probability distribution P or Q risk R Bayesian information gain Gain predictive risk (regret) V Rademacher complexity Radem soft falsifiability F covering number Cover hard falsifiability G VC-dimension vc set of hypotheses H Littlestone dimension ldim theory O Turing machine T We restrict to binary classification in this paper. 1. THE BAYESIAN INFORMATION GAIN AND THE INDUCED DISTRIBUTION This section presents Bayesian information gain and the induced distribution. They will be used to quantify falsifiability in sections x.3. Suppose that X is a finite set, and that we are given a conditional distribution Pm(y|x) and a prior PX on X. The conditional distribution models a noisy channel m connecting X to Y . Falsifiable =⇒ Learnable 5 Definition 1 (Bayesian information gain; induced distribution). The Bayesian information gain when m outputs y is Gain ( m, y,PX ) := D [ Pm(X|y) ∥∥∥PX(X)], where D[P ‖Q] := ∑ x∈X P(x) log P(x) Q(x) is the Kullback-Leibler divergence. The posterior Pm(x|y) is computed via Bayes' rule Pm(x|y) = Pm(y|x) * PX(x) Pm(y) , where Pm(y) = ∑ x∈X PX(x)Pm(y|x) is the m-induced distribution on Y . The Bayesian information gain quantifies how much observing y reduces uncertainty about X. We remark that Proposition 1. The mutual information communicated across m is the expected information gain Im(X,Y ) = E y∼Pm(Y ) Gain ( m, y,PX ) , where the expectation is with respect to the m-induced distribution on Y . Remark 1 (uniform priors on finite sets). Unless otherwise specified, finite sets are given the uniform prior: Punif(x) = 1|X| . We write Gain(m, y) as a shorthand for Gain(m, y,Punif). Given a function f : X → Y , define the corresponding conditional distribution Pf (y|x) = { 1 if y = f(x) 0 else. Lemma 2. Given a function f : X → Y , the f -induced distribution on Y is Pf (y) = { |f−1(y)| |X| if y ∈ im(f) 0 else. The Bayesian information gain is Gain(f, y) = { − logPf (y) if y ∈ im(f) undefined else. Lemma 3. The information gain is zero, Gain(f, y) = 0, if and only if f(x) = y for all x ∈ X. 2. STATISTICAL LEARNING Statistical learning is concerned with inductive inference under the assumption that observations are drawn independently from an unknown, but fixed, probability distribution. This section introduces falsifiability in detail. The later sections on sequential prediction and universal induction rely in part on the presentation developed here. 2.0. Setup Let X be an arbitrary set and Y = {0, 1}. Let Z = X × Y . A datum z = (x, y) in Z consists of an input x and an outcome or label y. A process is a map σ : X → Y from 6 D Balduzzi inputs to outcomes. The hypothesis space H := Y X = {σ : X → Y } is the set of all processes. Finally, an event (x, σ) is an element of X ×H. A theory is a set of hypotheses, O ⊂ H. Elements of the theory are referred to as predictors. Of course, by definition a predictor is also a hypothesis. Let ` : O ×X × Y → I denote the 0/1 loss: `(f, x, y) = I[f(x) 6= y] = { 0 if f(x) = y 1 else. Predictor f explains2 datum (x, y) if `(f, x, y) = 0. If not, then (x, y) falsifies f . 2.1. The risk (SLT) We assume throughout this section that the sample ~x contains n distinct points. Let X• = ⋃∞ t=1X t denote the set of finite sequences of elements of X. We typically refer to sequences ~x = (x1, . . . , xn) rather than sets {x1, . . . , xn} to keep notation and terminology consistent across sections. Definition A (risk, SLT). The risk of theory O on sequences of events is RSLTO : H×X• → I : (σ, ~x) 7→ inf f∈O 1 n n∑ t=1 ` ( f, xt, σ(xt) ) , where n = len(~x). The risk on distributions on data is RSLTO : ∆Z → I : PZ 7→ inf f∈O E z∼PZ ` ( f, z ) . The risk quantifies the fraction of events that the best predictor in O labels incorrectly – that is, the fraction of events that the theory cannot explain: RO : { sequence of events } 7→ { fraction of sequence that O cannot explain } . The risk is zero if and only if there is a predictor in O that explains the entire sequence of events perfectly. The set of hypotheses is not finite in general. However, since datasets are always finite, it turns out that the effective set of hypotheses is finite. Definition 2 (effective hypotheses). Given a sequence ~x = (x1, . . . , xn) of inputs, we say that two hypotheses σ1 and σ2 in H are equivalent σ1 ∼ σ2 if and only if σ1(xt) = σ2(xt) for all t ∈ {1, . . . , n}. We refer to an equivalence class [σ] = {τ ∈ H |σ ∼ τ} of hypotheses as an effective hypothesis and let Hef = {[σ] |σ ∈ H} denote the set of effective hypotheses. Since ~x contains n elements, it follows that there is a finite number (2n) of effective hypotheses. Two hypotheses in the same equivalence class are indistinguishable on the observed data, and thus indistinguishable to the risk. Given a sequence of n inputs ~x, the risk can be written as a function taking effective hypotheses about ~x to errors: RSLTO,~x : Hef → I : [σ] 7→ inf f∈O 1 n n∑ t=1 ` ( f, xt, σ(xt) ) . (A) Formulated in this way, the risk quantifies how well theory O explains the action of an hypothetical process σ on input sequence ~x. More precisely, the risk ε = RO,~x(σ) is 2Clearly, we are using 'explain' in a very weak, technical sense. Falsifiable =⇒ Learnable 7 the fraction of the inputs that the best predictor f in O misclassifies when labels are generated by σ. 2.2. Learnability (SLT) A theory is learnable if it admits a strategy whose predictions match the theory's best post hoc explanation. A strategy specifies the predictor that Forecaster will deploy in future as a function of previous events. Formally, a strategy is a function taking a finite dataset ~z = (z1, . . . , zn) ∈ Zn to a predictor in O. Let Ψn = {Zn → O} denote the set of strategies on datasets of size n. Example 1 (empirical risk minimization). A basic strategy is empirical risk minimization (ERM), which outputs the predictor that minimizes the training error: ψERM : Z n → O : (z1, . . . , zn) 7→ arginf f∈O 1 n n∑ t=1 `(f, zt). Following [Abernethy et al. 2009], we formulate learnability via a game played between Forecaster and Nature. Forecaster picks a strategy ψ ∈ Ψn. Nature observes Forecaster's strategy and responds by choosing a distribution PZ ∈ ∆Z on events. The value of the game is the generalization error of Forecaster's strategy on Nature's probability distribution: the difference between the predictive errors Forecaster's strategy accumulates and the explanatory errors of the theory's best predictor, judged after observing the distribution. Formally, the value of the game is the difference between the risk R{ψ(~z)}(PZ) of the strategy ψ(~z) and the risk RO(PZ) of the entire theory O. Forecaster aims to minimize the value; Nature aims for the opposite. The minimax value is thus VSLTn (O) := inf ψ∈Ψn sup PZ∈∆Z [ E ~z∼PZ E z′∼PZ ` ( ψ(~z), z′ ) − inf f∈O E z′∼PZ `(f, z′) ] } {{ } expected worst-case generalization error of Forecaster's best strategy More concisely, Definition B (predictive risk, learnability; SLT). The minimax value of the game, or the predictive risk of theory O on datasets of size n is VSLTn (O) = inf ψ∈Ψn} {{ } Forecaster's best strategy Nature's worst distribution{ }} { sup PZ∈∆Z [ E ~z∼PZ RSLTψ(~z)(PZ)−R SLT O (PZ) ] } {{ } strategy's generalization error on PZ . (B) the generalization error of Forecaster's best strategy when exposed to Nature's worst (for Forecaster) sequence of events. Theory O is learnable if limn→∞Vn(O) = 0. The predictive risk is the cost to Forecaster of not knowing what Nature will do next. It is measured against a baseline: Forecaster's best explanation of the entire sequence. The predictive risk thus separates the costs incurred due to predicting from the costs incurred due to having a theory that does not fit reality perfectly. If theory O is learnable then, for large n, the cumulative cost to Forecaster of not knowing what Nature will do next is negligible. Importantly, the predictive risk says nothing about the absolute performance of Forecaster's strategy. A theory may have low predictive risk and still predict a particular sequence of events badly since the baseline – the cost of using a theory that does not fit reality – is subtracted. 8 D Balduzzi 2.3. Falsifiability (SLT) A theory is falsifiable to the extent that there are hypotheses that it cannot explain. We quantify falsifiability in two ways. Definition C (falsifiability, SLT). Let QO,~x denote the RSLTO,~x-induced distribution on I. The soft falsifiability of O on ~x is the expected error FSLTn (O|~x) := 2 E ε∼QO,~x [ε] and FSLTn (O) := inf ~x∈Xn FSLTn (O|~x). (C-s) The hard falsifiability of O on ~x is GSLTn (O|~x) := 1 n Gain ( RSLTO,~x, 0 ) and GSLTn (O) := inf ~x∈Xn GSLTn (O|~x). (C-h) A theory is falsifiable if limn→∞Fn(O) = 1 or limn→∞Gn(O) = 1. Remark 2 (falsifiability depends on the risk). Falsifiability is a property of the risk RO,~x : Hef → I. It depends directly on the optimization problem underlying the learning scenario. In contrast, capacity measures are typically presented as properties of the theory O in such a way that their relation to the optimization problem (specifically, finding the predictor in O that minimizes the error) is indirect. Taking the infimum over all possible datasets implies that FSLTn (O) and GSLTn (O) measure worst-case falsifiability: the falsifiability of O on the least falsifiable input sequence. Soft falsifiability is closely related to Rademacher complexity, see Section 2.5. Similarly, hard falsifiability is closely related to the covering number, and so to the shattering coefficient and VC-dimension. The coefficients 2 and 1n in Definition C are chosen so that Lemma 4. Soft and hard falsifiability take values in the interval I = [0, 1]. (1) Theory O shatters {x1, . . . , xn} if and only if FSLTn (O|~x) = GSLTn (O|~x) = 0. (2) Theory O contains a single predictor if and only if FSLTn (O) = GSLTn (O) = 1 for all n. Proof. Straightforward. To interpret soft falsifiability, recall that the risk, (A), is function that takes an effective hypothesis σ about ~x to the fraction V of the sequence that theory O cannot explain (i.e. falsifies) RSLTO,~x : Hef → I : σ 7→ ε The pre-image R−1O,~x(ε) ⊂ H is the subset of hypotheses that, when applied to input sequence ~x, cannot be explain by theory O on fraction ε of ~x. Thus, the risk-induced probability of ε ∈ I is the fraction of potential hypotheses that, if true, cause O to falsify ε of the data: Q(ε) = |R−1O,~x(ε)| |Hef | . (1) Finally, soft falsifiability is the weighted sum: FSLT(O|~x) = 2 ∑ ε∈I ( |R−1O,~x(ε)| |Hef | * ε ) = 2 ∑ ε∈I { fraction of effective hypotheses falsified } * { on fraction ε of data } . Falsifiable =⇒ Learnable 9 To interpret hard falsifiability, apply Lemma 2 to obtain Gain(RO,~x, 0) = − logQ(0) = total # effective hypotheses{ }} { log |Hef | − # hypotheses O explains perfectly{ }} { log ∣∣R−1O,~x(0)∣∣ = { log -# of effective hypotheses that O falsifies } . If the inputs in ~x are distinct, then the number of effective hypotheses is 2n, so GSLTn (O|~x) = { log -# of effective hypotheses that O falsifies } log { # of effective hypotheses } can be interpreted as the "logarithmic fraction" of effective hypotheses that O falsifies. 2.4. Falsifiable =⇒ Learnable (SLT) The main result is that falsifiability controls predictive risk: Theorem D (main theorem, SLT). VSLTn (O) ≤ 1− FSLTn (O) ≤ d √ 1−GSLTn (O), (D) where d = √ 8. Surprisingly, the assumption that Nature is i.i.d. is not essential to the result – an almost identical theorem holds for sequential prediction, see section 3. Proof. By Proposition 6, soft falsifiability of a theory is essentially equivalent to its Rademacher complexity FSLT(O|~x) = 1− 2RademSLT ( `(O)|~x ) . Similarly, by Proposition 7, hard falsifiability recovers the covering number GSLT(O|~x) = 1− logCover SLT(O|~x) n . The result then follows by Theorem 8, which recalls two standard generalization bounds taken from [Rakhlin et al. 2014]. Remark 3 (vacuous bounds). Two ways in which Theorem D can be vacuous are (1) If a theory is completely unfalsifiable, Fn(O) = 0, then Theorem D provides no guarantees on its predictive performance no matter how well it explains empirical data. (2) If a theory is maximally falsifiable, Fn(O) = 1, then it has zero predictive risk, no matter how badly it explains empirical data. Corollary D' (falsifiability implies learnability, SLT). A theory is learnable if it is falsifiable: lim n→∞ Vn(O) = 0 if lim n→∞ Fn(O) = 1 or lim n→∞ Gn(O) = 1. A much stronger version Theorem D can also be shown. Theorem D" (data-dependent bounds, SLT). Let VSLTn (O|~z,P) := expected generalization error{ }} { RSLTψERM(~z)(P)} {{ } expected test error −RSLTψERM(~z)(~z)} {{ } training error 10 D Balduzzi be the expected generalization error of a predictor chosen using ERM. Suppose that ~z is a sequence of n events drawn from probability distribution P on Z. Let ~x refer to the same sequence, with labels stripped out. Then, for all δ > 0, with probability at least 1− δ, (1) the expected generalization error is upper bounded by VSLTn (O|~z,P) ≤ 1− FSLT(O|~x) + c √ 1− log δ n (D"-s) where c = √ 2 log e . (2) Furthermore, VSLTn (O|~z,P) ≤ d1 √ 1−GSLT(O|~x) + d2 √ 1− log δ n (D"-h) where d1 = √ 6 log e and d2 = √ 1 log e . Proof. Propositions 6 and 7 connect soft and hard falsifiability to the Rademacher complexity and covering number. The result then follows from Theorem 9, which collects two theorems from [Boucheron et al. 2000] and [Bousquet et al. 2004]. Theorem D" is a true inductive bound, which requires the i.i.d. assumption. It implies that the difference between the observed training error and expected test error depends on how many hypotheses about the training sequence ~x are falsified by theory O. In short, if strategy ψERM performs well on the training data, and theory O falsifies many hypotheses about the training data, then the predictor chosen by ψERM will perform well in future, with high probability. 2.5. Proofs (SLT) Our first two results relate soft falsifiability to Rademacher complexity [Koltchinskii 2001]. Definition 3 (Rademacher complexity). Define a Rademacher variable ζ to be a random variable taking values in Ω = {±1} with equal probability. Let ~ζ = (ζ1, . . . , ζn) be Rademacher variables. The Rademacher complexity of theory O on unlabeled inputs ~x = (x1, . . . , xn) is RademSLT(O|~x) := E ~ζ [ sup f∈O 1 n n∑ t=1 ζt * f(xt) ] . The Rademacher complexity of a theory with respect to a loss function is RademSLT ( `(O)|~z ) := E ~ζ [ sup f∈O 1 n n∑ t=1 ζt * ` ( f, (xt, yt) )] . Lemma 5. E ~ζ RO ( ~x, ζ * ~y ) = 1 2 − RademSLT ( `(O) ∣∣~z) = 1 2 − 1 2 RademSLT ( O ∣∣~z). Falsifiable =⇒ Learnable 11 Proof. For the first equality, observe that ζ * (1− 2`(f, z)) = { +1 if f(x) = ζ * y −1 else, which implies 1 2 − ζ * ( 1 2 − `(f, z) ) = `(f, (x, ζ * y)). It follows from inff∈O[−ψ(f)] = − supf∈O ψ(f) that E ~ζ RO ( ~x, ~ζ * ~y ) = E ~ζ inf f∈O n∑ t=1 `(f, (~x, ~ζ * ~y) = E ~ζ inf f∈O n∑ t=1 [ 1 2 − ζt ( 1 2 − `(f, zt) )] = 1 2 −E ~ζ sup f∈O n∑ t=1 ζt * `(f, zt) = 1 2 − RademSLT(`(O) |~z). The second equality follows similarly. A corollary of Lemma 5 is that Rademacher complexity is independent of the labels ~y. We therefore drop the labels from the notation and write RademSLT(O|~x) and RademSLT(`(O)|~x) below. Proposition 6 (Rademacher complexity from soft falsifiability, SLT). 1 2 FSLT(O | ~x) = 1 2 − RademSLT(`(O) | ~x) = 1 2 − 1 2 RademSLT(O | ~x). Proof. Recall that FSLT(O|~x) := 2Eε∼Q [ ε ] where Q is the RSLTO,~x-induced distribution on I. The induced distribution is Q(ε) = { |R−1O,~x(ε)| |Hef | if ε ∈ RO,~x(Y X) 0 else. By Lemma 5 it suffices to show that E~ζ RO ( ~x, ~ζ * ~y ) = Eε∼Q [ ε ] . Observe that E ~ζ RO ( ~x, ~ζ * ~y ) = ∑ [σ]∈Hef RO(~x, σ ◦ ~x) |Hef | = ∑ ε∈im(RO,~x) ε * |R−1O,~x(ε)| |Hef | = E ε∼Q [ ε ] . as required. Next, we relate hard falsifiability to the covering number. Definition 4 (covering number, SLT). Given unlabeled data ~x = (x1, . . . , xn) ∈ Xn and a theory O ⊂ Y X , let q denote the map q~x : O → Rn : f 7→ ( f(x1) . . . f(xn) ) taking predictors to labels. The covering number of O on ~x is CoverSLT(O|~x) := |q~x(O)|, 12 D Balduzzi the number of distinct labellings produced by the predictors in O applied to x1, . . . , xn. The shattering coefficient and VC-dimension are discussed in Section 3.6, see Definition 8. The covering number coincides with hard falsifiability: Proposition 7 (covering number from hard falsifiability, SLT). The hard falsifiability of theory O on ~x is GSLT(O|~x) = 1− 1 n logCoverSLT(O|~x). Proof. By definition, Gain(RO,~x, 0) = − log |R−1O,~x(0)| |Hef | . Since the sample contains n distinct points and |Y | = 2, it follows that log |Hef | = n. It is easy to check that |qx(O)| = |R−1O,~x(0)|. Theorem 8 (Data-independent bounds in expectation). Let RademSLTn ( `(O) ) := sup P∈∆Z E ~z∼P RademSLT ( `(O) ∣∣~z), where len(~z) = n. Then VSLTn (O) ≤ 2Radem SLT n ( `(O) ) ≤ 2 √ 2CoverSLTn (O) n . Proof. [Rakhlin and Sridharan 2014]. Theorem 9 (Data-dependent bounds with high probability). For all δ > 0, the following bounds hold with probability at least 1− δ, (1) The predictive risk is upper bounded by VSLTn (O|~z) ≤ 2Radem SLT ( `(O) ∣∣~x)+ c√1− log δ n , where c = √ 2 log e . (2) Furthermore, VSLTn (O|~z) ≤ d1 √ CoverSLT(O|~x) n + d2 √ 1− log δ n , where d1 = √ 6 log e and d2 = √ 1 log e . Proof. [Bousquet et al. 2004] and [Boucheron et al. 2000]. 3. SEQUENTIAL PREDICTION Sequential prediction is concerned with predicting a finite sequence of binary observations – without any assumptions on how the observations are generated. The i.i.d. assumption of statistical learning is replaced by an adversary that observes Forecaster's previous moves and responds maliciously. Falsifiable =⇒ Learnable 13 We build on the presentation in section 2. The key technical difference between statistical learning and sequential prediction is the introduction of trees, which requires us to distinguish between two notions of risk: soft and hard. Remarkably, the main theorem has an almost identical form in both sequential prediction and statistical learning. However, the stronger data-dependent form, Theorem D", no longer holds, see discussion in section 5. 3.0. Setup We introduce some useful notation from [Rakhlin et al. 2014]. Definition 5 (trees; paths). Let Ω = {−1,+1}. A Z-valued tree of depth n is an n-tuple ~z = (z1, . . . , zn) of functions zt : Ωt−1 → Z. Trees are denoted with boldface. A path is an element ~ω = (ω1, . . . , ωn) ∈ Ωn. Combining a path ~ω with a tree ~z, obtains a sequence ~z(~ω) = (z1, z2(ω1), . . . , zn(ω1:n−1)) of elements in Z. It will be convenient to use the shorthand Xt := XΩ t = {xt : Ωt → X}. Let X• =⋃∞ t=1 X t denote the set of all X-valued trees. 3.1. The risk (SEQ) We assume throughout this section that ~x contains a path with n distinct points. Definition A (risk, SEQ). Let H = Y X = {σ : X → Y } denote the set of hypotheses on X. The risk for sequential prediction is RSEQO : H× (Ω×X) • → I : (σ, ~ω, ~x) 7→ inf f∈O 1 n n∑ t=1 ` ( f,xt(ω1:t−1), σ ( xt(ω1:t−1) )) , where n = len(~ω) = len(~x). The risk for sequential prediction differs from statistical learning in that the inputs are trees, not elements, and the choice of path in Ωn is an additional degree of freedom. There are two obvious ways to deal with paths: (1) Incorporate paths into the input by defining Xn := Ωn×Xn. Given an X-valued tree ~x = (x1, . . . ,xn) and a path ~ω ∈ Ωn, we say that two hypotheses σ and τ in H are equivalent σ ∼ τ iff σ ( xt(ω1:t−1) ) = τ ( xt(ω1:t−1) ) ∀t ∈ {1, . . . , n}. Define the soft risk, RSEQO,(~ω,~x) : Hef → I : σ 7→ inff∈O [ 1 n n∑ t=1 ` ( f,xt(ω1:t−1), σ ( xt(ω1:t−1) ))] . (A-s) (2) Incorporate paths into the hypotheses by defining H := H × Ωn. Similarly, two hypotheses (σ, ~ω) and (τ, ~ρ) in H = H× Ωn are equivalent (σ, ~ω) ∼ (τ, ~ρ) iff σ ( xt(ω1:t−1) ) = τ ( xt(ρ1:t−1) ) ∀t ∈ {1, . . . , n}. Let Õ = O × Ωn and define the hard risk, RSEQÕ,~x : Hef → I : (σ, ~ρ) 7→ inf (f,~ω)∈Õ [ 1 n n∑ t=1 ` ( f,xt(ω1:t−1), σ ( xt(ρ1:t−1) ))] . (A-h) 14 D Balduzzi 3.2. Learnability (SEQ) Consider the following game played between Forecaster and Nature over n rounds [Abernethy et al. 2009; Rakhlin et al. 2014]. In the first round, Forecaster chooses a probability distribution P1 ∈ ∆O on the set of predictors. Nature observes Forecaster's choice, and picks z1 ∈ Z. A predictor f1 is then sampled at random from P1, applied to z1 and the loss `(f1, z1) is computed. The game continues for n rounds, where both Forecaster and Nature observe the moves played in previous rounds. The value of the game is Forecaster's regret: the difference between Forecaster's cumulative loss and the loss Forecaster would have accumulated, had it played the best move in hindsight. Forecaster's goal is to minimize its regret; Nature's aims for the opposite: VSEQn (O) = infP1∈∆O sup z1∈Z E f1∼P1 * * * inf Pn∈∆O sup zn∈Z E fn∼Pn 1 n [ n∑ t=1 `(ft, zt)− inf f∈O n∑ t=1 `(f, zt) ] } {{ } Forecaster's regret Forecaster's move at time t depends on the prior moves by Forecaster and Nature. Forecaster's strategy at time t can be expressed as a function ψt : Zt−1 → O. Let Ψt = {ψt : Zt−1 → O} denote the strategies available to Forecaster at time t, and let Ψ = ∏n t=1 Ψt denote the strategies available to Forecaster over an n-round game. Similarly, Nature's strategy at time t is an element of Ξt = Ot−1 ×∆O → Z. Let Ξ =∏n t=1 Ξt denote the n-round strategies available to Nature. We can write the minimax value more compactly as VSEQn (O) = infP∈∆Ψ sup ~ξ∈Ξ E ~ψ∼P 1 n [ n∑ t=1 ` ( ψt(ξ1:t−1), ξt(ψ1:t−1,Pt) ) − inf f∈O n∑ t=1 ` ( f, ξt(ψ1:t−1),Pt )] , where the sup and inf are understood to unravel recursively as above. Finally, substituting in the risk obtains Definition B (predictive risk, SEQ). The minimax value of an n-round game, or predictive risk of theory O, is VSEQn (O) = infP∈∆Ψ sup ~ξ∈Ξ E ~ψ∼P [ RSEQ~ψ ( ~ξ ) −RSEQO (~ξ) ] . (B) Theory O is learnable if limn→∞VSEQn (O) = 0. The first term, R~ψ(~ξ) is the cumulative loss incurred by the best O-based strategy played out on Nature's sequence of moves ~ξ. The comparator term, RO(~ξ) is the performance of the best predictor in O, taken in hindsight. 3.3. Falsifiability (SEQ) We use the soft and hard risk to define soft and hard falsifiability: Definition C (falsifiability, SEQ). Let QO,(~ω,~x) be the RSEQO,(~ω,~x)-induced distribution on I. The soft falsifiability of theory O on ~x is the expected error of the soft risk FSEQn (O|~x) := 2 E ~ω∼Punif(Ωn) E ε∼QO,(~ω,~x) [ε] and FSEQn (O) := inf ~x∈X FSEQn (O|~x). (C-s) Falsifiable =⇒ Learnable 15 The hard falsifiability of theory O on ~x is the information gain from the hard risk GSEQn (O|~x) := 1 n Gain(RSEQO×Ωn,~x, 0) and G SEQ n (O) := inf ~x∈X GSEQn (O|~x). (C-h) A theory is falsifiable if limn→∞Fn(O) = 1 or limn→∞Gn(O) = 1. Hard falsifiability is closely related to the sequential covering number introduced in [Rakhlin et al. 2014]. However, the definition is more intuitive and, importantly, it also leads to combinatorial bounds such as the Littlestone dimension, see Section 3.5 for details. 3.4. Falsifiable =⇒ Learnable (SEQ) Finally, we obtain the main theorem for sequential prediction, which is an exact analog of the corresponding theorem for statistical learning: Theorem D (main theorem, SEQ). VSEQn (O) ≤ 1− FSEQn (O) ≤ c √ 1−GSEQn (O) (D) where c = √ 8. An important point is that hard falsifiability provides a non-vacuous upper-bound for the zero-covering number, see Section 3.6. Proof. By Proposition 10, soft falsifiability is equivalent to the sequential Rademacher complexity FSEQ(O|~x) = 1− 2RademSEQ ( `(O)|~x) ) . The first inequality then follows from Theorem 11, taken from [Rakhlin et al. 2014]. By Lemma 12 and Proposition 13, hard falsifiability can be used to upper bound the sequential zero-covering number: CoverSEQ(O|~x) n ≤ 1−GSEQ(O|~x). The second inequality then follows from Theorem 14, also taken from [Rakhlin et al. 2014]. Corollary D' (falsifiability implies learnability, SEQ). A theory is learnable if it is falsifiable: lim n→∞ Vn(O) = 0 if lim n→∞ Fn(O) = 1 or lim n→∞ Gn(O) = 1. 3.5. Proofs (SEQ) This section proves the falsification bounds in Theorem D for sequential prediction. Definition 6 (Sequential Rademacher complexity). RademSEQ(O|~x) := E ~ζ [ sup f∈O 1 n n∑ t=1 ζtf(xt(ζ1:t−1)) ] Proposition 10 (Rademacher complexity from induced distribution, SEQ). Let Q~ω := PRSEQO,(~ω,~x) be the distribution on errors in I induced by the soft risk R SEQ O,(~ω,~x) : H → I. Then, RademSEQ(`(O), ~x) = 1 2 − E ~ω∼Punif (Ωn) E ε∼Q~ω [ ε ] . 16 D Balduzzi Proof. As for Proposition 6. Theorem 11. The predictive risk of sequential prediction is bounded by VSEQn (O) ≤ 2 sup ~x∈X RademSEQ ( `(O), ~x ) , where the sup is over trees of length n. Proof. [Rakhlin et al. 2014]. Next, we upper bound the covering number of a tree-process. The following definition is given in [Rakhlin et al. 2014] Definition 7 (covering number, SEQ). A zero-cover of O on an X-valued tree ~x is a set V of Y -valued trees such that ∀f ∈ O, ∀(ω1, . . . , ωn) ∈ Ωn, ∃v ∈ V s.t. f(xt(ω1:t−1)) = vt(ω1:t−1) ∀t ∈ {1, . . . , n}. The covering number of O on x is CoverSEQ(O, ~x) = min{|V | : V is a zero-cover}. The sequential covering number is awkward for our purposes since, unlike the statistical covering number in Definition 4, it is not defined as the cardinality of the image of a function. We therefore need the following Lemma 12 (upper bound for sequential covering number). Let q~x : Õ → Rn : (f, ~ω) 7→ ( f(x1), f(x2(ω1), . . . , f(xn(ω1:n−1) ) The covering number is upper bounded by CoverSEQ(O, ~x) ≤ |q~x(Õ)|. Proof. We prove the lemma by constructing a zero-cover Vq of O on ~x with |q~x(Õ)| elements. Suppose the image q~x(O × Ωn) has N elements, q1, . . . ,qN . Define vj(ω1:t−1) := q j t . That is, vj(ω1:t−1) is the tth element of qj for all paths in Ωn. Then, by construction Vq = {v1, . . . ,vN } is a zero-cover of ~x containing N elements, and we are done. Proposition 13. Gain(RSEQÕ,~x , 0) = n− log |q~x(Õ)|. Proof. As for Proposition 7. Theorem 14. Let ~x be an X-valued tree of length n. Then, RademSEQ(O, ~x) ≤ √ 2 logCoverSEQ(O, ~x) n Proof. [Rakhlin et al. 2014]. It follows from Lemma 12, Proposition 13 and Theorem 14 that hard falsifiability can be used to upper bound the predictive risk for sequential prediction. Falsifiable =⇒ Learnable 17 3.6. A sequential-to-statistical reduction Definition 7, of the sequential covering number, is fairly intricate and fragile. For example, slightly changing the definition by reordering the quantifiers gives a quantity that grows much too fast and yields vacuous generalization bounds [Rakhlin and Sridharan 2014]. A natural concern is therefore that the upper bound in Lemma 12 is too loose. In the remainder of this section, we show that |q~x(Õ)|, and so hard falsifiability, is a useful, non-vacuous upper bound. Definition 8 (shattering, VC and Littlestone dimensions). We have the following analogous definitions: (1) Statistical. Theory O shatters input sequence ~x of length n if ∀~ω ∈ Ωn ∃f ∈ O s.t. f(xt) = ωt + 1 2 ∀t ∈ {1, . . . , n}. Alternatively, O shatters ~x if CoverSLT(O|~x) = 2n. The VC-dimension is vc(O) := sup { n ∣∣ ∃ input sequence ~x of length n s.t. O shatters ~x} (2) Sequential. Theory O SEQ-shatters tree ~x of length n if ∀~ω ∈ Ωn ∃f ∈ O s.t. f ( xt(ω1:t−1) ) = ωt + 1 2 ∀t ∈ {1, . . . , n}. The Littlestone dimension is ldim(O) = sup ~x { n ∣∣∃X-valued tree ~x of length n s.t. O SEQ-shatters ~x}. Let Y X • := {σ : X• → Y } denote the set of hypotheses on the set X• of X-valued trees. Given theory O ⊂ Y X , define the new theory Õ := O × Ω• ⊂ Y X • : (f, ~ω)(xt) = f ( xt(ω1:t−1) ) . The lifted theory Õ acts on trees, which from our point of view are just another set. The statistical covering number for Õ is given, following Definition 4m using the function, q~x : Õ → Rn : (f, ~ω) 7→ ( (f, ~ω)(x1), . . . , (f, ~ω)(xn) ) with CoverSLT(Õ|~x) = |q~x(Õ)|. The VC-dimension of Õ is then computed straightforwardly. Proposition 15 (VC-dimension lower bounds Littlestone dimension). The Littlestone dimension of O is lower-bounded by the VC-dimension of the lifted theory Õ = O × Ω•: vc(Õ) ≤ ldim(O). The proposition shows that the Littlestone dimension can be recovered from hard falsifiability. Thus, hard falsifiability can play the same role as the sequential covering number in reducing learning problems into combinatorial problems. Proof. Suppose there is a tree ~x of length n shattered by Õ. We construct a new tree ~z of length n that is SEQ-shattered by O. 18 D Balduzzi Thus, we assume that ∀(ω1, . . . , ωn) ∈ Ωn, ∃(f,~b) ∈ Õ s.t. f ( xt(b1:t−1) ) = ωt + 1 2 ∀t ∈ {1, . . . , n}. (2) Let α denote the function specified by α(ω1:t−1) = b1:t−1, as in (2). Construct the new tree ~z by ~z = ~x ◦ α. It follows, by the construction of α and by (2), that ∀(ω1, . . . , ωn) ∈ Ωn, ∃f ∈ O such that f ( ~z(ω1:t−1) ) = f ( xt ◦ α(ω1:t−1) ) = f ( xt(b1:t−1) ) = ωt + 1 2 ∀t ∈ {1, . . . , n} as required. The following instructive example, taken from [Rakhlin and Sridharan 2014], was designed to exhibit the intricacy of the sequential covering number's definition. We conclude by computing the statistical covering number of Õ on the example, and showing that it yields the correct result. Example 2. Consider the function class O = {fa | a ∈ I, fa(x) = 0 ∀x 6= a, fa(a) = 1} ⊂ Y I. Assuming that the tree ~x takes on 2n−1 distinct values (the "worst case"), then for any ordered pair (f, ~ω) we have that q~x(fa, ~ω) = ( fa(x1), fa(x2(ω1), . . . , fa(xn(ω1:n−1) ) is either equal to all zeros, or all zeros with a single coordinate that equals one. The image of q~x therefore contains at most n+ 1 points and in fact |q~x(Õ)| = n+ 1. 4. UNIVERSAL INDUCTION The third setting is universal induction, which is concerned with predicting computable sequences of binary observations. The setting differs significantly from statistical learning and sequential prediction. For example, universal induction cannot be modeled adversarially since both Nature and Forecaster have too many degrees of freedom. There are at least two interpretations of universal induction: U1. Universal. Forecaster has a single, universal theory. U2. Adaptive. Forecaster constructs a series of theories in response to successive observations. The first interpretation is standard. The second, which we advocate here, is new. Both are legitimate. Under the first interpretation, it does not make sense to evaluate the falsifiability of theories – since there is only one theory and it is universal. The only choice that matters is Nature's choice of sequence ~y. It then turns out that the number of hypotheses Nature falsifies (eliminates) whilst choosing ~y controls Forecaster's predictive risk, see section 4.6. Under the second interpretation, developed in detail below, Forecaster's predictive risk is controlled by the number of hypotheses that Forecaster falsifies whilst adapting its theories. 4.0. Setup Let X denote the set of valid programs, where valid programs X ⊂ ⋃∞ t=1{0, 1}t form a prefix-free set. A prefix-free universal Turing machine T takes valid programs to Falsifiable =⇒ Learnable 19 outputs. Let Y∞ = {0, 1, 00, 01, 10, 11, 000, . . .} denote the set of all binary sequences, of finite or infinite length. A Turing machine is a function T : X → Y∞. Let Y = T (X ) ⊂ Y∞ denote the set of computable sequences. Prefix free strings formalize the notion of a computer program. For example, the set of valid C++ programs is a prefix free set since C++'s syntax ensure one program cannot be the prefix of another. The set of valid programs has a complicated structure, since it includes strings of varying length. It is mathematically convenient to force programs to have a fixed length. First, let Xn = {~x ∈ X | len(~x) = t for some t ≤ n}. Second, pad out short programs: given a program ~x of length t < n, construct 2n−t programs of length n by adding arbitrary suffixes to ~x. For example, if len(~x) = n − 2, then the four padded programs are {~x00, ~x01, ~x10, ~x11}. The Turing machine ignores the padding. Concretely, a C++ compiler would also ignore the padding, so the paddedout programs are all functionally equivalent. Let Hn denote the set of binary strings of length ≤ n and let On ⊂ Hn denote the set of valid, padded programs of length n. Denote the function that strips out the padding by Sn : Hn → X ∪ {∅} : ~h 7→ { ~x if On 3 ~h = ~x~s for ~x a valid program with padding ~s ∅ else. In other words, if the string contains a valid program as prefix, then Sn strips out the padding. If the string does not contain a valid program, then Sn outputs a null character. The reason for introducing padded strings is that it allows the following simple description of the Solomonoff prior as a limit distribution, induced by the uniform distribution on padded strings: Definition-Proposition 16 (Solomonoff prior). Equip Hn with the uniform distribution for all n. Let Pn denote the Sn-induced distribution on X ∪ {∅}. Then PS(~x) := lim n→∞ Pn(~x) = 2− len(~x) for all ~x ∈ X . Let Qn denote the (Sn ◦ T )-induced distributed on Y∞. The Solomonoff prior is QSOL(~y) := lim n→∞ Qn(~y) = ∑ {~x|T (~x)=~y•} 2− len(~x). Proof. The standard definition of the Solomonoff prior, and a demonstration that our definition coincides with the standard, are provided in section 4.5. Proposition 16 allows us to consider how Solomonoff induction acts on inputs to the Turing machine, instead of its outputs. 4.1. The risk (UNI) For universal induction, the loss compares the sequences generated by Nature and Forecaster element-wise: ` : Y × Y → R : (y, y′) 7→ I[y 6= y′], where as above Y = {0, 1}. 20 D Balduzzi Definition A (risk, UNI). The risk for universal induction is Rn : Hn ×Hn → R≥0 : (~x, ~f) 7→ ∞∑ t=1 ` ( T (~x)t, T (~f)t ) The risk of theory On := Xn is RUNIOn : H → R≥0 : ~x 7→ inf ~f∈On ∞∑ t=1 ` ( T (~f)t, T (~x)t ) . As for statistical learning and sequential prediction, we reinterpret the risk as a function from hypotheses – that is, programs with length at most n – to nonnegative reals Rñy : Hn → R≥0 : ~x 7→ ∞∑ t=1 ` ( T (~x)t, yt ) . (A) In the limit we obtain RUNI~y := limn→∞R n ~y as a function R UNI ~y : H → R≥0. 4.2. Learnability (UNI) Suppose that Nature chooses a sequence ~y ∈ Y and reveals ~y1:t−1 = (y1, . . . , yt−1) at time t. Let ψt = {ψt : Y t−1 → ∆Y } denote the set of strategies available to Forecaster in round t, and Ψ = ∏∞ t=1 ψt the set of all strategies available to Forecaster. The risk of strategy ψ is RUNIψ : H → R≥0 : ~x 7→ ∞∑ t=1 E ` ( ψt ( T (~x)1:t−1 ) , T (~x)t ) , where the expectation is over the outputs of the (probabilistic) strategy. A particularly important strategy is Solomonoff induction [Solomonoff 1964]: Definition-Proposition 17 (Solomonoff induction). Let Ont := (Rny1:t−1) −1(0) = { hypotheses of length ≤ n that explain y1:t−1 } . Theory Ont is a finite set; equip it with the uniform distribution. Let Pn,t(~x) denote the Sn-induced distribution on X and Qn,t(~y) denote the (Sn ◦ T )-induced distribution on Y∞. Solomonoff induction is the strategy: (ψSOL)t : Y t−1 → ∆Y : y1:t−1 7→ lim n→∞ Qn,t(yt) = QSOL(yt|y1:t−1). Solomonoff induction depends on the choice of Turing machine, although this dependence is typically not explicit in our notation. Proof. We show that limn→∞Qn,t(yt) = QSOL(yt|y1:t−1) in section 4.5. Solomonoff induction can be interpreted as follows. Forecaster's theory at time step t is Ot := limn→∞Ont , a limit of finite sets. All hypotheses consistent with the previous observations y1:t−1 are weighted equally (recalling that padding entails redundancies). Forecaster predicts the next observation by drawing from Ot uniformly at random. After observing yt, and regardless of whether or not Forecaster's prediction at time t was correct, Forecaster constructs new theory Ot+1 in the light of yt. In short, Solomonoff induction learns by constructing a nested set of progressively smaller theories and predicts by sampling from them uniformly at random. Falsifiable =⇒ Learnable 21 Definition B (predictive risk, UNI). The predictive risk of strategy ψ and theory On is VUNI(ψ −On|~y) := RUNIψ ( ~y ) −RUNIOn ( ~y ) The predictive risk of strategy ψ is VUNI(ψ|~y) := lim n→∞ VUNIψ (On|~y). (B) 4.3. Falsifiability (UNI) This subsection and the next relate the error accumulated using Solomonoff induction to the falsifiability of the string chosen by Nature. Definition C (falsifiability, UNI). GUNIT (~y) := lim n→∞ Gain(Rñy , 0). (C-h) Remark 4. The definition for universal induction differs from statistical learning and sequential prediction, in that the coefficient 1n is not present, and so G UNI does not necessarily take values in [0, 1]. To interpret hard falsifiability, first fix an ambient hypothesis spaceHn, and consider the hypotheses falsified when observing the substring y1:t: GnT (~y1:t) = log 2 n − log |Ont | = { log -# strings of length n } − { log -# strings that output y1:t } = { log -# strings of length n falsified by y1:t } . Second, consider the hypotheses eliminated when transitioning between theories: log |Ont | − log |Ont−1| = { log -# strings outputting y1:t−1 } − { log -# strings outputting y1:t } = { log -# strings falsified when modifying Ot−1 7→ Ot } . Finally, combining the above obtains GUNIT (~y) = ∞∑ t=1 lim n→∞ ( GnT (~y1:t)−GnT (~y1:t−1) ) where y1:0 := ∅ = ∞∑ t=1 lim n→∞ ( log |Ont | − log |Ont−1| ) = ∞∑ t=1 { log -# strings falsified when modifying Ot−1 7→ Ot } . Thus, the hard falsifiability of ~y is the number of hypotheses Forecaster eliminates in the process of adapting its theory to the data. Note that theories are falsified prior to predicting: at time t, Forecaster first eliminates hypotheses based on y1:t and then uses the new theory Ot+1 to predict yt+1. 4.4. Falsifiable =⇒ Learnable (UNI) The main theorem for universal induction differs from statistical learning and sequential prediction, in that Forecaster's theory is not fixed. Falsifiability quantifies the hypotheses that Forecaster eliminates whilst adapting its theory. The more Forecaster is required to adapt – prior to predicting – the weaker the guarantee on its predictive performance. 22 D Balduzzi Theorem E (main theorem, UNI). The predictive risk under Solomonoff induction (1) coincides with the expected error and (2) is bounded by the number of hypotheses Nature falsifies when choosing the string ~y: V(ψSOL|~y) = RUNIψSOL ( ~y ) ≤ GUNIT (~y). (E) Proof. By Lemma 19, the predictive risk and risk coincide for universal induction: VUNI(ψ|~y) = RUNIψ ( ~y ) . By Proposition 20, the hard falsifiability of ~y coincides with (the negative logarithm of) the Solomonoff prior GUNIT (~y) = − logQSOL(~y). Finally, the result follows by Solomonoff 's Theorem 21. More generally, Theorem E suggests that Bayesian updating is a way of modifying theories, whose cost (measured in errors) can be bounded using falsifiability. We conclude by relating falsifiability to Kolmogorov complexity. Intuitively, a string is simple if it is the output of a short computer program. More formally, Definition 9 (Kolmogorov complexity). The Kolmogorov complexity of a string, with respect to Turing machine T , is the length of the shortest program that outputs the string as a prefix [Kolmogorov 1965]: KT (~y) := min ~x∈X { len(~x) ∣∣ T (~x) = ~y • } The Kolmogorov complexity KT depends on the choice of Turing machine up to an additive constant that does not depend on ~y [Li and Vitányi 2008]. Proposition 18 (relation between falsifiability and Kolmogorov complexity). Falsifiability lower bounds Kolmogorov complexity: GUNIT (~y) ≤ KT (~y). Further, GUNIT (~y) = KT (~y) up to an additive constant that does not depend on ~y. Proof. The inequality follows from the definitions of the Solomonoff prior and Kolmogorov complexity. By Levin's coding theorem [Li and Vitányi 2008], the Kolmogorov complexity of a string coincides with the negative log probability of the string according to the Solomonoff prior up to an additive constant. 4.5. Proofs (UNI) EquipHn with the uniform distribution and let PSn(X ) denote the Sn-induced distribution on X . Recall that we defined the Solomonoff prior as the limit of the T ◦Sn-induced distribution on Y QSOL(~y) := lim n→∞ PT ◦Sn(~y), where T ◦ Sn : Hn S n −−→ X ∪ {∅} T−→ Y ∪ {∅}. Definition-Proposition 16. The following hold: (1) The limit PS(X ) := limn→∞ Pn(X ) is well-defined with PS(~x) = 2− len(~x). Falsifiable =⇒ Learnable 23 (2) The limit QSOL(Y) := PT ◦S(Y) = limn→∞ PT ◦Sn(Y) is well-defined and coincides with the Solomonoff prior. That is, QSOL(~y) = ∑ {~x∈X|T (~x)=~y•} 2− len(~x). Proof. Claim 1. By Lemma 2, the induced probability of a valid program is PSn(~x) = {∑ ~x~s 1 2n = 2n−len(~x) 2n = 2 − len(~x) if len(~x) ≤ n 0 else. Thus, limn→∞ PSn(~x) = 2− len(~x) for all valid programs. Claim 2. Also by Lemma 2. Recall that the standard definition of Solomonoff induction is as the strategy: (ψSOL)t : Y t−1 → ∆Y : y1:t−1 7→ QSOL(yt|y1:t−1) := QSOL(y1:t) QSOL(y1:t−1) . Definition-Proposition 17. The two definitions of Solomonoff induction coincide: lim n→∞ Qn,t(yt) = QSOL(y1:t) QSOL(y1:t−1) . Proof. The theory Ont is the set of all strings of length ≤ n consistent with the observation y1:t−1. Pushing the uniform distribution on Hn forward onto Y∞ yields, asymptotically, the conditional Solomonoff distribution. Lemma 19 (predictive risk reduces to risk). If ~y is computable then VUNI(ψ|~y) := lim n→∞ VUNIψ (On|~y) = RUNIψ ( ~y ) . Proof. As n → ∞, the theory incorporates all valid programs, and so can match any computable sequence. Thus, lim n→∞ RUNIOn (~y) = 0 and the result follows. Proposition 20 (hard falsifiability and Solomonoff prior). The hard falsifiability of string ~y for Turing machine T is GUNIT (~y) = − logQSOL(~y). Proof. Observe that the risk factorizes as RUNI~y : X T−→ Y ∑ `−−→ R ~x 7→ T (~x) 7→ ∑∞ t=1 ` ( T (~x)t, yt ) . The proposition follows from the following two claims. 24 D Balduzzi Claim 1. Gain(T , ~y) = − logQSOL(~y) for all ~y ∈ Y. Consider the function T : X → Y , where X is equipped with the distribution PS(X ) from Proposition 16. Since Turing machines are deterministic, we have that PT (~y|~x) = 1, and so PT (~x|~y) = PT (~y|~x) * PS(~x) PT (~y) = PS(~x) PT (~y) It follows that Gain(T , ~y) = D [ PT (X|~y) ∥∥∥PS(X )] = ∑ ~x∈X PT (~x|~y) log PT (~x|~y) PS(~x) = ∑ ~x∈X PT (~x|~y) log PT (~y|~x) * PS(~x) PT (~y) * PS(~x) = ∑ ~x∈X PT (~x|~y) log 1 PT (~y) = − logPT (~y) = − logQSOL(~y). where the last equality follows from Proposition 16. Claim 2. GUNI(~y) = Gain(T , ~y). Follows from GUNI(~y) = Gain(R~y, 0) and R−1~y (0) = T −1(~y). Concatenating the claims yields the desired result. Theorem 21 (generalization bound for Solomonoff induction). ∞∑ t=1 E ` ( ψSOL(y1:t−1), yt ) ≤ − logQSOL(~y). Proof. The following proof is taken from [Hutter 2011]: ∞∑ t=1 E ` ( ψSOL(y1:t−1), yt ) = ∞∑ t=1 ∣∣1−QSOL(yt|y1:t−1)∣∣ ≤ − ∞∑ t=1 logQSOL(yt|y1:t−1) = − logQSOL(~y), where the inequality holds because 1− x ≤ − log x. 4.6. Interpreting Solomonoff induction as a universal theory Under the standard interpretation, Forecaster's theory is O and GUNIT (~y) counts the hypotheses falsified by Nature whilst choosing ~y: GUNIT (~y) = lim n→∞ [ log { # strings of length n } − log { # that output y }] = lim n→∞ { log-# strings of length n that Nature falsifies } . 5. DISCUSSION [A] theory of induction is superfluous. It has no function in a logic of science. The best we can say of a hypothesis3 is that up to now it has been able to show its worth, and that it has been more successful than other hypotheses although, in principle, it can never be justified, verified, 3This paper uses 'theory' in the sense that Popper uses 'hypothesis'. Falsifiable =⇒ Learnable 25 or even shown to be probable. This appraisal of the hypothesis relies solely upon deductive consequences (predictions) which may be drawn from the hypothesis: There is no need even to mention 'induction'. – from [Popper 1959]. We conclude by discussing the paper's implications for scientific inference, focusing on the ideas of Karl Popper. According to Popper, inductive inference is meaningless. As an alternative, he advocated hypothetico-deductive inference, which proceeds as follows [Gelman and Shalizi 2013]. Forecaster makes observations, proposes a theory, and deduces consequences. A theory is scientific if it is falsifiable. That is, if it is possible to deduce empirically testable consequences. The scientific method, according to Popper, is: to propose falsifiable theories that are in line with past observations; to subject them to severe empirical tests; and to discard and replace them if and when they are falsified. Popper's ideas are extremely influential in the scientific community. Indeed, he is essentially the only philosopher that scientists draw on as a resource to evaluate and compare theories. Philosophers, however, consider Popper's approach to be fundamentally flawed [Godfrey-Smith 2011]. The three main problems that have been identified are: P1. Infinite alternatives. The set of imaginable hypotheses is infinite, so that it is trivial to find a collection of specific hypotheses that a specific theory falsifies. P2. Stochasticity. It is unclear how to apply Popper's ideas to stochastic theories, which cannot be definitely falsified. P3. No confirmation. Popper rejected the notion that positive evidence should increase our confidence in a scientific theory. Rejecting confirmation eliminates any rationale, aside from habit, for using a well-tested theory over a brand new theory, assuming both are falsifiable. Our formulation of falsifiability does not exactly line up with what Popper had in mind. We proceed regardless. Problem P1 is solved by restricting attention to the finite set of effective hypotheses. Problem P2 is also solved as a corollary of our results. Soft and hard falsifiability are defined with respect to deterministic hypotheses, whereas the predictive risk allows probabilistic hypotheses. Problem P3 is more interesting. If Nature is i.i.d. then Theorem D" provides a guarantee on a predictor's future accuracy that depend on the theory's falsifiability and the predictor's past performance. Thus, with the addition of the i.i.d. assumption, there is quantifiable confirmation. If no assumptions are made about Nature's behavior, then the setting is sequential prediction. The most that can be said is that, if a theory is falsifiable, then its predictive performance can be as good as its explanatory performance in hindsight. Nothing absolute can be said about predictive performance a priori. Finally, Solomonoff induction is purported to be a (non-computable) theory that optimally explains and predicts every computable string. However, observe that Theorem E says nothing about Solomonoff induction's predictive performance unless GUNIT (~y) or the Kolmogorov complexity KT (~y) are known a priori – which is never the case. For example, suppose Nature picks a string that contains 109 zeros followed by 109 coin flips, followed by only zeros. Solomonoff induction's error rate on the first billion instances will not be indicative of its performance on the next billion. Assuming that Nature chooses strings with low Kolmogorov complexity is analogous to, albeit weaker than, assuming Nature is i.i.d. The current state-of-the-art in learning theory therefore supports Popper's intuitions about falsifiability – including his rejection of confirmation. In a more positive vein, 26 D Balduzzi learning theory suggests that inductive inference requires additional assumptions and provides tools for analyzing their implications. Acknowledgments. I am grateful to Samory Kpotufe, Jacob Abernethy and Pedro Ortega for useful discussions. REFERENCES Jacob Abernethy, Alekh Agarwal, Peter L Bartlett, and Alexander Rakhlin. 2009. A stochastic view of optimal regret through minimax duality. In COLT. David Balduzzi. 2011. Information, learning and falsification, In Philosophy and Machine Learning workshop, Neural Information Processing Systems (NIPS). arXiv (2011). David Balduzzi. 2013. Falsification and Future Performance. In Algorithmic Probability and Friends: Bayesian Prediction and Artificial Intelligence, David Dowe (Ed.). LNAI, Vol. 7070. Springer, 65–78. S Boucheron, G Lugosi, and P Massart. 2000. A Sharp Concentration Inequality with Applications. Random Structures and Algorithms 16, 3 (2000), 277–292. Olivier Bousquet, Stéphane Boucheron, and Gábor Lugosi. 2004. Introduction to Statistical Learning Theory. In Advanced Lectures on Machine Learning, O Bousquet, U von Luxburg, and G Rätsch (Eds.). Springer, 169–207. Nicolo Cesa-Bianchi and Gabor Lugosi. 2006. Prediction, Learning and Games. Cambridge University Press. David Corfield, Bernhard Schölkopf, and V Vapnik. 2009. Falsification and Statistical Learning Theory: Comparing the Popper and Vapnik-Chervonenkis Dimensions. Journal for General Philosophy of Science 40, 1 (2009), 51–58. Andrew Gelman and Cosma Shalizi. 2013. Philosophy and the practice of Bayesian statistics. Brit. J. Math. Statist. Psych. 66 (2013), 8–38. Peter Godfrey-Smith. 2011. Popper's Philosophy of Science: Looking Ahead. In The Cambridge Companion to Popper, J Shearmur and G Stokes (Eds.). Cambridge University Press. Gilbert Harman and Sanjeev Kulkarni. 2007. Reliable Reasoning: Induction and Learning Theory. MIT Press. Marcus Hutter. 2011. Universal Learning Theory. In Encyclopedia of Machine Learning, Claude Sammut and Geoffrey I Webb (Eds.). Springer. A N Kolmogorov. 1965. Three approaches to the quantitative definition of information. Problems Inform. Transmission 1, 1 (1965), 1–7. V Koltchinskii. 2001. Rademacher penalties and structural risk minimization. IEEE Trans. Inf. Theory 47 (2001), 1902–1914. M Li and P Vitányi. 2008. An Introduction to Kolmogorov Complexity and Its Applications. Springer. Karl Popper. 1959. The Logic of Scientific Discovery. Hutchinson. Alexander Rakhlin and Karthik Sridharan. 2014. STAT928: Statistical Learning Theory and Sequential Prediction. Lecture Notes. Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. 2014. Online Learning via Sequential Complexities. In JMLR. R J Solomonoff. 1964. A formal theory of inductive inference I, II. Inform. Control 7, 1-22, 224-254 (1964). V Vapnik. 1995. The Nature of Statistical Learning Theory. Springer.