The Archimedean trap: Why traditional reinforcement learning will probably not yield AGI Samuel A. Alexander∗ † The U.S. Securities and Exchange Commission 2020 Abstract After generalizing the Archimedean property of real numbers in such a way as to make it adaptable to non-numeric structures, we demonstrate that the real numbers cannot be used to accurately measure nonArchimedean structures. We argue that, since an agent with Artificial General Intelligence (AGI) should have no problem engaging in tasks that inherently involve non-Archimedean rewards, and since traditional reinforcement learning rewards are real numbers, therefore traditional reinforcement learning cannot lead to AGI. We indicate two possible ways traditional reinforcement learning could be altered to remove this roadblock. 1 Introduction Whenever we measure anything using a particular number system, the corresponding measurements will be constrained by the structure of that number system. If the number system has a different structure than the things we are measuring with it, then our measurements will suffer accordingly, just as if we were trying to force square pegs into round holes. For example, the natural numbers make lousy candidates for measuring lengths in a physics laboratory. Lengths in the lab have properties such as, for example, the fact that for any two distinct lengths, there is an intermediate length strictly between them. The natural numbers lack this property. Imagine the poor physicist, brought up in a world of only natural numbers, scratching his or her head upon encountering a rod with length strictly between two rods of length 1 and 2. ∗Email: samuelallenalexander@gmail.com †2010 Mathematics Subject Classification: 97R40 (Primary), 26E35 (Secondary) 1 It is tempting to think of the real numbers R-i.e., the unique complete ordered field-as a generic number system with whatever structure suits our needs. But the real numbers do have their own specific structure. That structure is flexible enough to accomodate many needs, but we shouldn't just take that for granted. One particular property constraining the real numbers is the following. Lemma 1. (The Archimedean Property1) Let r > 0 be any positive real number. For every real number y, there is some natural number n such that nr > y. Rather than directly prove Lemma 1, we will prove a generalized result which, we will argue, is more adaptable to other structures. Lemma 2. (The Generalized Archimedean Property) Let r > 0 be any positive real number. For any x, y ∈ R, say that x is significantly less than y if x ≤ y − r. If x0, x1, x2, . . . is any infinite sequence of real numbers, where each xi is significantly less than xi+1, then for every real number y, there exists some i such that y is significantly less than xi. Proof. If not, then there is some y such that y + r > xi for all i. Thus, X = {x0, x1, x2, . . .} has an upper bound. By the completeness of R, X must have a least upper bound z ∈ R. Since z is the least upper bound for X, z − r is not an upper bound for X, so there is some i such that xi > z − r. By assumption, xi ≤ xi+1 − r, so xi+1 > z, contradicting the choice of z. Lemma 1 follows from Lemma 2 by letting xi = ir. The above property is automatically inherited by subsystems of the reals, such as the rational numbers Q, the natural numbers N, the integers Z, or the algebraic numbers. All inherit the Generalized Archimedean Property in obvious ways. Lemma 2 allows us to adapt the notion of Archimedeanness to other things than real numbers, even to things for which there is no notion of arithmetic at all (Lemma 1 would not adapt to such things). All we need is a notion of "significantly greater than". For any set of things, some of which are "significantly 1The Archimedean property is named after Archimedes of Syracuse. A similar property appears as the fifth axiom in his On the Sphere and Cylinder [5]: Further, of unequal lines, unequal surfaces, and unequal solids, the greater exceeds the less by such a magnitude as, when added to itself, can be made to exceed any assigned magnitude among those which are comparable with [it and with] one another. Note that Archimedes specifically restricts his statement to lengths, surface areas and volumes, in fact going out of his way to limit the magnitudes to which said length/area/volume can be made to exceed (he could have saved some words by stopping his sentence at "...can be made to exceed any assigned magnitude", if that were his intention). The Archimedean property is also closely related to Definition 4 of Book V of Euclid's Elements [12]: (Those) magnitudes are said to have a ratio with respect to one another which, being multiplied, are capable of exceeding one another. Proposition 1 of Book X is also relevant. Many math historians speak of the modern-day Archimedean property, Archimedes' 5th axiom, and Euclid's properties as being identical, but in fact they are all subtly different from one another, see [6]. 2 greater than" others, we can ask whether or not the property in Lemma 2 holds. We will make this formal in Section 2. Example 3. (Fuzzy widgets) Suppose we have some fuzzy widgets, and we observe that certain widgets are fuzzier than others. Naturally, we are inclined to quantify the fuzziness of the widgets, assigning them numerical fuzziness measures from some number system. Nine times out of ten, we choose to use the real numbers, or a subsystem thereof, often without a second thought. But suppose among these widgets, there happen to be widgets w1, w2, . . . such that each wi is significantly less fuzzy than wi+1, and another widget w∞ such that all the wi's are significantly less fuzzy than w∞. Suddenly, our decision to use real numbers puts us in a bind. It is impossible to assign real number fuzziness measures to our widgets in such a way that significantly less fuzzy widgets get significantly smaller real number measures. That would contradict Lemma 2. Note that the above example does not require us to have any notion of multiplying fuzziness by a natural number n (as we would need to have if we wanted to adapt Lemma 1). This illustrates the enhanced adaptability of Lemma 2. The structure of this paper is as follows. • In Section 2 we formally adapt Lemma 2 to obtain a notion of Archimedeanness for non-numerical structures, and demonstrate that non-Archimedean such structures cannot accurately be measured using the real numbers. • In Section 3 we argue that traditional reinforcement learning will not lead to AGI because its rewards are overly constrained. • In Section 4 we discuss non-traditional variations of reinforcement learning that avoid the problem of overly constrained rewards. • In Section 5 we summarize and make concluding remarks. 2 Generalized Archimedean Structures The real numbers possess the Archimedean property, but other structures may or may not. To make this more precise, we introduce the following formalism, adapting from Lemma 2. Definition 1. A significantly-ordered structure is a collection X with an ordering . For x1, x2 ∈ X, we say x1 is significantly less than x2 if x1 x2. A significantly-ordered structure is Archimedean if it has the following property: for every X-sequence x0 x1 x2 * * * , for every x∞ ∈ X, there is some i such that x∞ xi. For any real number r > 0, a prototypical example of an Archimedean significantly-ordered structure is the real numbers with defined such that x y if and only if x ≤ y − r. 3 Definition 2. Suppose (X,) is a significantly-ordered structure. A function f : X → R is said to accurately measure (X,) if there is some real r > 0 such that the following requirement holds: • For all x, y ∈ X, x y if and only if f(x) ≤ f(y)− r. The following proposition formalizes the dilemma we illustrated in Example 3 (think of X as a set of things we want to measure). Proposition 4. (Inadequacy of the reals for non-Archimedean structures) Suppose (X,) is a significantly-ordered structure. If X is non-Archimedean, then no function f : X → R accurately measures (X,). Proof. Assume, for sake of a contradiction, that some f : X → R exists which accurately measures (X,). Thus there is some real r > 0 such that for all x, y ∈ X, x y if and only if f(x) ≤ f(y) − r. Since X is non-Archimedean, there is some X-sequence x0 x1 x2 * * * and some x∞ ∈ X such that there is no i such that x∞ xi. By choice of r, each f(xi) ≤ f(xi+1)− r and there is no i such that f(x∞) ≤ f(xi)− r. This contradicts Lemma 2. Proposition 4 tells us that we cannot accurately measure non-Archimedean structures using real numbers2. Any attempt to do so will necessarily be misleading, because ordering relationships among the non-Archimedean structures will fail to be reflected by the real-number measurements given to them. We will inevitably end up like the puzzled physicist brought up in a world of only natural numbers, confronted by a rod of length 1.5. Remark 5. (Spearman's Law of Diminishing Returns) Suppose (X,) is a non-Archimedean significantly-ordered structure with elements x0, x1, . . . and x∞ such that x0 x1 * * * and each xi x∞. Suppose f : X → R has the property that f(x0) < f(x1) < * * * and each f(xi) < f(x∞). Then the monotone convergence theorem implies that limi→∞ f(xi) converges, which in turn implies that limi→∞(f(xi+1) − f(xi)) = 0. This suggests a general law of diminishing returns: any time a non-Archimedean significantly-ordered structure (X,) is measured using real numbers, if the measurement does not blatantly violate (in other words, if there are no x y such that x is given a larger real-number measurement than y), then there will inevitably be elements x0 x1 * * * exhibiting diminishing returns, in the sense that the measurements of xi and xj are approximately equal for large i, j. If human intelligence is non-Archimedean, this could potentially shed light on a psychometrical phenomenon called Spearman's Law of Diminishing Returns [29] [8] [13], the empirical tendency of cognitive ability tests to be less correlated in high-intelligence populations. Even tiny measurement errors would eventually dominate the test result differences as the true results plateau. 2There is an area of research known as measurement theory, which, traditionally, "takes the real numbers as a pre-given numerical domain" [20]. Some work has been done to generalize measurement theory away from this assumption [19] [28] [25]. We would submit this paper as further motivation in that direction. 4 Example 6. (Examples of non-Archimedean structures) • (Sets) Say that set U is significantly smaller than set V if there is an injective function from U into V but there is no bijective function from U onto V . It is easy to show there are sets U0, U1, . . ., with each Ui significantly smaller than Ui+1, and each Ui is significantly smaller than U∞ = ∪∞i=0Ui. Thus, sets are non-Archimedean. In the field of set theory, mathematicians measure the size of sets using Georg Cantor's famous nonArchimedean number system, the cardinal numbers. • (Logical theories) It is not difficult to come up with (for example) true theories T0, T1, . . . (in the language of arithmetic) such that each Ti+1 proves the consistency of Ti, and an additional true theory T∞ (in the language of arithmetic) which proves the consistency of ∪∞i=0Ti. In a sense, then, each Ti is significantly weaker than Ti+1 (see Gödel's incompleteness theorems), and each Ti is significantly weaker than T∞. In this sense, logical theories are non-Archimedean. In the field of proof theory [22] [23], logicians measure the logical strength of theories using computable ordinal numbers, another non-Archimedean number system. • (Asymptotic runtime complexities) SupposeA0, A1, . . . are algorithms such that each Ai has runtime complexity Θ(n i), and suppose A∞ is an algorithm with runtime complexity Θ(2n). Then in a certain sense, each Ai has significantly lower asymptotic runtime complexity than Ai+1, and each Ai has significantly lower asymptotic runtime complexity than A∞. In this sense, asymptotic runtime complexity is non-Archimedean. In computer science, these runtime complexities are usually measured using big-O, bigΘ, or similar notation systems. Example 7. (Speculative examples of potentially non-Archimedean structures) Certain structures might plausibly be non-Archimedean, but it is a difficult question to say whether they truly are or not. The reader could come up with such examples in great abundance. • (Musical beauty) Assuming there is such a thing as objective musical beauty (not contingent on features of the human condition, etc.), then it is plausible that music might be non-Archimedean, in the following sense: there might be songs S0, S1, . . . such that each Si is significantly less beautiful than Si+1, and another song S∞ such that each Si is significantly less beautiful than S∞. • (Ethical utility) Early utilitarian Jeremy Bentham suggested a hedonistic calculus in which pleasure measurements would be assigned to actions, to help adjudicate ethical dilemmas. His successor, John Stuart Mill, objected that some actions are incomparably better than others: "If one of [two pleasures] is, by those who are competently acquainted with both, placed so far above the other that they prefer it ... and would not resign it for any quantity of the other pleasure which their nature is capable 5 of, we are justified in ascribing to the preferred enjoyment a superiority in quality, so far outweighing quantity as to render it, in comparison, of small account" [18]. This suggests that Bentham's pleasures are nonArchimedean. • (AGI) It is plausible that there are3 AGIs A0, A1, . . . such that each Ai is significantly less intelligent than Ai+1, and another AGI A∞ such that each Ai is significantly less intelligent than A∞. We first pointed this out in [3], where we propose measuring the intelligence of mechanical knowing agents using computable ordinals, the same non-Archimedean number system which proof theorists use to measure logical strength of mathematical theories. Incidentally, if AGI intelligence is non-Archimedean, then Proposition 4 shows it is impossible to measure machine intelligence using real numbers without some of those measurements being misleading4. • (Nonstandard cosmologies) Some authors [1] [4] [24] [27] [9] have even speculated about the nature of non-Archimedean space and/or time. 3 Reinforcement learning In reinforcement learning (RL), an agent interacts with an environment, taking actions from a fixed set of possible actions. With every action the agent takes, the environment responds with a new observation and with a reward. In traditional RL, these rewards are real numbers (many authors further constrain them to be rational numbers). By restricting rewards to be real (or rational) numbers, we unconsciously constrain RL to only be applicable toward tasks of an inherently Archimedean nature. For example, Wirth et al point out [33] that in tasks related to cancer treatment [34], "the death of a patient should be avoided at any cost. However, an infinitely negative reward breaks classic reinforcement learning algorithms and arbitrary, finite values have to be selected." This problem could be avoided if instead of real numbers, rewards were drawn from a suitable non-Archimedean number system containing negative infinities. Doing so would be a departure from traditional RL. Example 8. To give an intuitive example, assume that musical beauty is nonArchimedean, as in Example 7. We can imagine environments where the RL agent is tasked with composing songs. For example, the possible actions the 3As hinted by Protagoras, assuming Protagoras's own intelligence stays constant and remains higher than the intelligence of his student and that they live forever and that better means significantly better : "The very day you start, you will go home a better man, and the same thing will happen the day after. Every day, day after day, you will get better and better." [21] 4This would solve an open problem implicitly stated by Legg and Hutter [16] when they said of their real-number universal intelligence measure: "...none of these people have been able to communicate why the work [on measuring universal intelligence using real numbers] is so obviously flawed in any concrete way ... If anyone would like to properly explain their position to us in the future, we promise not to chase you down the street!" 6 agent is allowed to take might include one action for each piano key, plus an additional "stand and bow" action to signal that a song is finished. Whenever the agent stands and bows, the agent is rewarded with applause based on the beauty of the song the agent composed5. Assuming musical beauty is nonArchimedean, such an environment falls outside the possibility of traditional RL. By Lemma 4, there is no way to assign real number rewards to songs without misleading the agent. If S0, S1, S2, . . . are songs where each Si is significantly less beautiful than Si+1, and all the Si are significantly less beautiful than another song S∞, then there is no way to assign real-valued rewards to these songs such that each Si gets significantly less reward than Si+1 and significantly less reward than S∞. Or, to re-use the cancer example, assume there are certain bad procedures the robotic surgeon could take, each one significantly worse than the previous, but all still significantly better than killing the patient. There is no way to assign real-valued rewards to these actions, and to killing the patient, in such a way that each bad action gets punished significantly harsher than the previous, but still significantly more forgiving than the punishment for killing the patient. The reader might object by challenging the non-Archimedeanness of music and of medical procedures. But we only used those to make the examples more intuitive. If the reader insists, we can resort to mathematical tasks. Example 9. Imagine that the agent is tasked with typing up mathematical theories, and when the agent stands and bows, the agent is rewarded with applause based on the proof-theoretical strength of the theory (or hit with tomatoes if the theory is inconsistent). In Example 6 we noted that proof-theoretical strength of theories is non-Archimedean. There exist theories T0, T1, . . ., each significantly proof-theoretically weaker than the next, and another theory T∞, significantly proof-theoretically stronger than all the Ti's. We cannot possibly assign real-valued rewards to these theories without misleading the agent. The reader might object to Example 9 on the grounds that judging the proof-theoretical strength of a theory is inherently non-computable anyway. The example could be modified so that instead of typing up mathematical theories, the agent has to type up mathematical subtheories in (say) the language of Peano arithmetic, accompanied by consistency proofs in (say) ZFC. It can be shown that the proof-theoretical strength of mathematical theories is still non-Archimedean, even when restricted to subtheories of arithmetic whose consistency can be proven in ZFC6. 5To quote Wang and Hammer: "Decision makings often do not happen at the level of basic operations, but at the level of composed actions, where there are usually infinite possibilities." [32] 6For example, let T0 be the theory of Peano arithmetic, and for each i, let Ti+1 be Ti together with CON(Ti), a canonical axiom encoding the consistency of Ti. Let T∞ be the theory of Peano arithmetic along with CON(∪iTi). ZFC is certainly adequate to prove the consistency of each of these theories. In the sense of Example 6, each Ti is significantly weaker than Ti+1 and significantly weaker than T∞. 7 The reader might object that the above theories-with-proofs example is contrived. But an AGI with human or better intelligence should have no problem at least comprehending and attempting such a task (regardless of whether or not the AGI is able to perform well at it). When we prove that the Halting Problem is unsolvable, we do so by considering contrived programs that we could write if the Halting Problem were solvable. The contrivedness of those programs does not invalidate the proof of the unsolvability of the Halting Problem. Again, when we prove that C++ templates are Turing complete [31], we do so by considering extremely bizarre C++ templates that would never arise naturally in a software studio. This does not invalidate the proof that C++ templates are Turing complete. Finally, the reader might object that approximating infinite rewards with arbitrary large finite rewards is good enough. Who cares (the argument might go) whether pushing a button gives the agent infinite pleasure or only a million units of pleasure? Either way (the argument goes) the agent is going to learn to prefer that button over a button that gives only .1 units of pleasure. The following example shows that this logic breaks down in non-Markov environments. Example 10. (Delayed gratification) Consider an environment with a red button and a blue button. Pushing the red button always grants +1 reward. As for the blue button, suppose the agent presses the blue button for the ith time. If i = 2j for some integer j, then the agent shall receive a reward of ω (the smallest infinite ordinal), but otherwise, the agent shall receive 0 reward. If we approximate ω with a real-value of, say, 1, 000, 000, then after a long enough time spent in the environment, an AGI will be misled into thinking that it isn't worth the longer and longer wait-times between blue-button rewards: eventually, it will take more than 1, 000, 000 blue-button-presses to get rewarded, and the AGI will consider it more worthwhile to get the guaranteed +1 reward from the red button. Our critic could respond to Example 10 by making the approximation dynamic, say, making the 2jth press of the blue button grant 1000000 * 2j reward, but at this point, the critic is clearly just hard-coding the correct actions into the reward function, something which is only possible in Example 10 because the environment is simple enough that we can completely understand it ourselves. For the kinds of non-trivial environments where AGI would actually be useful, such carefully engineered reward approximations would quickly become intractible. Reinforcement Learning is useful for many practical tasks, but at least in its traditional flavor, it is too constrained (by its arbitrary choice of number system for its rewards) to apply to certain non-Archimedean tasks7, which, however contrived they are, could certainly be attempted by an AGI. Traditional reinforcement learning will not lead to AGI. 7Perhaps explaining why "despite almost two decades of RL research, there has been little solid evidence of RL systems that may one day lead to [AGI]" [17]. 8 4 Non-traditional reinforcement learning We have argued that traditional RL cannot lead to AGI, because an AGI is capable of attempting non-Archimedean tasks whose rewards are too rich to express using real numbers. There are at least two potential ways to change RL so as to make it applicable to such tasks and, thus, at least potentially capable of leading to AGI. Of course, there is no guarantee that removing the roadblock in this paper will cause RL to lead to AGI. There might be other roadblocks besides the inadequate reward number system8. 4.1 Preference-based reinforcement learning A lot of exciting research has been done on non-traditional variations of RL where, instead of giving the agent numerical rewards for taking actions, one instead informs the agent about the relative preference of various actions or action-sequences. See [33] for a survey. This very nicely side-steps the problems from this paper. 4.2 Reinforcement learning with other number systems The most obvious way to modify RL to avoid the problems presented in this paper is to change which number system is used9. As far as this author is aware, the choice to use real (or rational) numbers for rewards was not made based on any fundamental criteria10. The real (or rational) numbers are currently a useful pragmatic choice because they are easy to compute with using 21st century software and 21st century school curricula, but that's hardly relevant in the field of genuine AGI. One might say the real numbers were a good choice because they are familiar, but even that is arguable: in general, students are usually not taught what the real numbers actually are, unless they major in pure mathematics at the university level. Anyway, the familiarity argument is totally irrelevant in the field of AGI. Various non-Archimedean number systems exist. Number systems can be discrete or continuous; the nature of reinforcement learning clearly suggests a continuous number system. We will consider three possible number systems: formal Laurent series; hyperreal numbers; and surreal numbers. 8For example, many RL authors consider non-deterministic environments where rewards and observations include an element of randomness. The probabilities involved are, traditionally, assumed to be real numbers. Perhaps some recent work [7] on non-Archimedean probability could be relevant against that roadblock. 9Anticipated by [25]. 10Niederée points out [20] that there are no deeper reasons to assume that the number system should necessarily even have the same cardinality as R. And Rizza says [25]: "No particular feature of the space of informational states suggests that such a codomain [as R] should be selected". 9 4.2.1 Formal Laurent series David Tall described [30] the following real-number-extending number system (which he called the "superreals", but that vocabulary does not seem to have caught on). Definition 3. A formal Laurent series is a formal expression of the following form (where m can be any integer and a−m, a−(m−1), . . . , a0, a1, a2, . . . are real numbers, a−m 6= 0): ∞∑ i=−m aiε i. Suppose A = ∑∞ i=−m aiε i and B = ∑∞ i=−n biε i are two distinct formal Laurent series. We declare A < B if and only if ai < bi where i is the smallest index such that ai 6= bi (where we consider ai to be 0 for all i < −m and we consider bi to be 0 for all i < −n). We can consider the real numbers R to be embedded in the formal Laurent series by way of the embedding r 7→ rε0. Having done so, the intuition is that, for example, 1ε1 is what we might call a "first-order infinitesimal number", smaller than every positive real; 1ε2 is what we might call a "second-order infinitesimal number", smaller than every positive first-order infinitesimal number; and so on. Likewise, 1ε−1 is what we might call a "first-order infinite number", bigger than every real; 1ε−2 is what we might call a "second-order infinite number", bigger than every first-order infinite number; and so on. Thus, the formal Laurent series are quite adequate to address the specific problem described by Wirth et al [33] in which an infinite negative reward is required when the RL agent kills the cancer patient. There are natural ways to define arithmetic on formal Laurent series, but we will avoid those details here. The advantage of the formal Laurent series number system is that it is relatively concrete, compared to the more abstract hyperreal or surreal numbers discussed below. Example 11. (Examples of formal Laurent series comparisons) 1. Consider A = 5ε−1− 2ε0 + 3ε1 + 4ε2 and B = 5ε−1− 2ε0 + 1ε1 + 4ε2 + 5ε6. The ε−1and ε0-coefficients of A and B are equal, so we compare their ε1-coefficients. A has an ε1-coefficient of 3 and B has an ε1-coefficient of 1, and 3 > 1, so A > B. 2. Consider A = 999999ε5 and B = 0.00001ε4. We consider A to have an ε4-coefficient of 0, which is smaller than B's ε4-coefficient (0.00001), so A < B. There is a natural way to consider formal Laurent series as a significantlyordered structure, generalizing the notion of "significantly greater than" from Lemma 2. 10 Definition 4. 1. For every Laurent series A = ∑∞ i=−m aiε i, let o(A) = −m (call this the order of A), and let LC(A) be the εo(A)-coefficient of A (call this the leading coefficient of A). 2. Let r > 0 be any positive real number. For any formal Laurent series A = ∑∞ i=−m aiε i and B = ∑∞ i=−n biε i, we say A r B if one of the following conditions holds: • o(A) > o(B) and LC(B) > 0; or • o(A) < o(B) and LC(A) < 0; or • o(A) = o(B) and LC(A) ≤ LC(B)− r. Lemma 12. For any real r > 0, the formal Laurent series, considered as a significantly-ordered structure according to r, are non-Archimedean. Proof. Let x0 = 0ε 1, x1 = rε 1, x2 = 2rε 1, and in general let xi = irε 1. Let x∞ = 1ε 0. Then each xi r xi+1, and yet each xi r x∞. Unfortunately, although the formal Laurent series contain infinities and infinitesimals, in a sense we will make formal, they still do not contain "enough" infinities and infinitesimals to accomodate genuine AGI. To make this formal, we introduce a weaker notion of Archimedeanness. Definition 5. Suppose (X,) is a significantly-ordered structure. We define a new order ′ on X as follows. For any x, y ∈ X, we declare x ′ y if and only if there is a sequence x0, x1, . . . such that the following conditions hold: 1. x0 = x. 2. Each xi xi+1. 3. Each xi y. We say (X,) is 2-Archimedean if (X,′) is Archimedean. Proposition 13. For any real r > 0, the formal Laurent series are 2-Archimedean when considered as a significantly-ordered structure as in Definition 4. We omit the proof of Proposition 13 because it is somewhat tedious and we do not need it anywhere later in the paper. It merely serves to explain why formal Laurent series are still not good enough to avoid the AGI roadblock from this paper. Just as we argued that a genuine AGI should be capable of engaging in tasks that involve inherently non-Archimedean rewards, by similar reasoning, an AGI should be capable of engaging in tasks that involve inherently non-2-Archimedean rewards. It is not hard to show that all the structures in Example 6 are non-2-Archimedean. Thus, replacing real number rewards by formal Laurent series rewards is not enough to remove the roadblock, but it would at least expand the types of rewards which are possible. 11 4.2.2 Hyperreal numbers The field of mathematics where the calculus is formalized with infinite and infinitesimal quantities is called nonstandard analysis [26]. The numbers most commonly associated with this field are the so-called hyperreal numbers. The hyperreal numbers can be introduced axiomatically or by means of a semi-constructive method which depends on usage of a certain black box, a device known as a free ultrafilter. Logicians have proven that free ultrafilters exist but that, unfortunately, it is impossible to concretely exhibit one. This severely limits (if not completely ruins) the practical usefulness of reinforcement learning with hyperreal rewards. Nevertheless, the hyperreals might be useful for proving abstract structural properties about AGI11. It can be shown that the hyperreals are not 2-Archimedean, and indeed, not α-Archimedean for any ordinal α, where "αArchimedean" refers to a certain natural weakening of 2-Archimedeanness. Thus, for the purpose of proving abstract theorems about RL agents with generalized rewards, the hyperreals would be more appropriate than the formal Laurent series. 4.2.3 Surreal numbers All of the well-known non-Archimedean extensions of R (including formal Laurent series and hyperreals) are subsystems of the so-called surreal numbers [10] [15] [11]. The surreal numbers were initially discovered during John Conway's attempts to study two-player combinatorial games like Go and Chess, so it would not be surprising if they turn out to be important in the eventual development of AGI. Unlike the hyperreals, the construction of the surreal numbers does not depend on any non-constructive black boxes such as free ultrafilters. They are constructed as the union of a hierarchy Sα of subsystems where α ranges over the ordinal numbers. Assuming that agents with AGI are implemented using computers with no additional power beyond the Church-Turing Thesis, then for the purposes of AGI, it would be appropriate to restrict our attention to some computable subset of the surreal numbers, which would presumably be the union of some hierarchy Cα where α ranges over the computable ordinal numbers. For any particular level Cα in this hierarchy, we could consider the sub-universe Eα of surreal-reward RL environments with rewards restricted to Cα. Assuming AGI agents are Turing computable, no individual AGI can possibly comprehend codes for all computable ordinals, because the set of codes of computable ordinals is badly non-computably-enumerable. This is profound, because it seems to suggest that any particular AGI can only comprehend RL 11Similar to the way we use free ultrafilters in [2] to obtain comparators of the utilitymaximizing ability of traditional deterministic RL agents, and prove structural properties about said comparators. In fact, in that paper, we essentially independently re-invented the free ultrafilter construction of the hyperreals, without realizing it at the time! 12 environments in Eα if that AGI can comprehend α. In other words, for any particular RL environment e with computable surreal number rewards, there must be some minimal computable ordinal α such that e has rewards from Eα; if an AGI is not intelligent enough to comprehend α, then it seems like there should be no way for the AGI to comprehend e either12. We would submit this state of affairs as strong evidence in favor of our thesis [3] that a machine's intelligence ought to be measured in terms of the computable ordinals which the machine comprehends. The above paragraph points at a possible joint path toward AGI incorporating both machine learning and symbolic logic, perhaps a much-needed reconciliation of these two approaches. 4.3 Alternate number systems: tentative verdict For many simple environments not too far outside of traditional RL, formal Laurent series could probably serve as a fairly practical number system. But formal Laurent series have limitations which suggest that RL with formal Laurent series rewards will not be enough to reach AGI, for the exact same reason that RL with real number rewards will not be enough. Because of their dependence on free ultrafilters, the hyperreal numbers will probably never be of practical use as RL rewards, but it could conceivably be possible to use them to prove abstract structural results about AGI from a bird's-eye view. The surreal numbers (or a computable subset thereof) seem like the most promising candidate for RL rewards that could plausibly lead to AGI. We would certainly hesitate to call them "practical", though. To work with any but the most trivial of surreal numbers, one would need to implement sophisticated symbolic-logical machinery, and that's just to get one's foot in the door. This does, however, offer a ray of hope in the sense that doing deep learning techniques with surreal numbers would be a way to combine both symbolic logic and statistical methods into a joint approach. 5 Conclusion In traditional reinforcement learning, utility-maximizing agents interact with environments, receiving real (or rational) number rewards in response to actions, and using those rewards to update their behavior. We have argued that the decision to limit rewards to real numbers is inappropriate in the context of AGI, because the real numbers have the Archimedean property, which makes it impossible to use them to accurately portray the value of actions when a task involves inherently non-Archimedean rewards. Thus, we argue, traditional RL cannot possibly lead to AGI, because a genuine AGI should have no trouble comprehending and at least attempting tasks that inherently involve non-Archimedean rewards. We suggested two possible ways to modify traditional reinforcement 12This situation is highly reminiscent of [14]. 13 learning to fix this bug: switch to preference-based reinforcement learning, or else generalize reinforcement learning to allow rewards from a non-Archimedean number system. Acknowledgments We gratefully acknowledge Bryan Dawson, Mikhail Katz, Brendon Miller-Boldt, Stewart Shapiro, and the SEC's Quantitative Analytics Unit's machine learning seminar for comments and feedback. References [1] Haidar Al-Dhalimy and Charles J Geyer. Surreal time and ultratasks. The Review of Symbolic Logic, 9(4):836–847, 2016. [2] Samuel Allen Alexander. Intelligence via ultrafilters: structural properties of some intelligence comparators of deterministic Legg-Hutter agents. Journal of Artificial General Intelligence, 10:24–45, 2019. [3] Samuel Allen Alexander. Measuring the intelligence of an idealized mechanical knowing agent. In Cognition, Interdisciplinary Foundations, Models, and Applications (CIFMA). Springer, 2019. [4] Hajnal Andréka, Judit X Madarász, István Németi, and Gergely Székely. A logic road from special relativity to general relativity. Synthese, 186(3):633– 649, 2012. [5] Archimedes. On the sphere and cylinder. In Thomas Heath, editor, The works of Archimedes. Cambridge University Press, 1897. [6] Jacques Bair, Piotr B laszczyk, Robert Ely, Valérie Henry, Vladimir Kanovei, Karin U Katz, Mikhail G Katz, Semen S Kutateladze, Thomas McGaffey, David M Schaps, David Sherry, and Steven Shnider. Is mathematical history written by the victors? Notices of the American Mathematical Society, 60(7):886–904, 2013. [7] Vieri Benci, Leon Horsten, and Sylvia Wenmackers. Non-Archimedean probability. Milan Journal of Mathematics, 81(1):121–151, 2013. [8] Diego Blum and Heinz Holling. Spearman's law of diminishing returns. a meta-analysis. Intelligence, 65:60–66, 2017. [9] Lu Chen. Infinitesimal gunk. Journal of Philosophical Logic, forthcoming. [10] John H Conway. On Numbers and Games. CRC Press, 2nd edition, 2000. [11] Philip Ehrlich. The absolute arithmetic continuum and the unification of all numbers great and small. Bulletin of Symbolic Logic, 18:1–45, 2012. 14 [12] Euclid. Book V: Theory of proportion. In John Casey, editor, First Six Books of the Elements of Euclid. Project Gutenburg, 2007. [13] José Hernández-Orallo. AI generality and Spearman's law of diminishing returns. Journal of Artificial Intelligence Research, 64:529–562, 2019. [14] Bill Hibbard. Measuring agent intelligence via hierarchies of environments. In International Conference on Artificial General Intelligence, pages 303– 308. Springer, 2011. [15] Donald Ervin Knuth. Surreal numbers: a mathematical novelette. AddisonWesley, 1974. [16] Shane Legg and Marcus Hutter. Universal intelligence: A definition of machine intelligence. Minds and machines, 17(4):391–444, 2007. [17] Scott Livingston, Jamie Garvey, and Itamar Elhanany. On the broad implications of reinforcement learning based AGI. In International Conference on Artificial General Intelligence, pages 478–482, 2008. [18] John Stuart Mill. Utilitarianism. In Seven masterpieces of philosophy, pages 337–383. Routledge, 2016. [19] Louis Narens. Measurement without Archimedean axioms. Philosophy of Science, 41(4):374–393, 1974. [20] Reinhard Niederée. What do numbers measure?: A new approach to fundamental measurement. Mathematical Social Sciences, 24(2-3):237–276, 1992. [21] Plato. Protagoras. In John M Cooper, Douglas S Hutchinson, et al., editors, Plato: complete works. Hackett Publishing, 1997. [22] Wolfram Pohlers. Proof theory: The first step into impredicativity. Springer, 2008. [23] Michael Rathjen. The art of ordinal analysis. In Proceedings of the International Congress of Mathematicians, volume 2, pages 45–69, 2006. [24] Patrick F Reeder. Infinitesimals for Metaphysics: Consequences for the Ontologies of Space and Time. PhD thesis, The Ohio State University, 2012. [25] Davide Rizza. Divergent mathematical treatments in utility theory. Erkenntnis, 81(6):1287–1303, 2016. [26] Abraham Robinson. Non-standard analysis. Princeton University Press, 1974. [27] Elemer E Rosinger. Cosmic contact: To be, or not to be Archimedean? arXiv preprint physics/0702206, 2007. 15 [28] Heinz J Skala. Non-Archimedean utility theory. D. Reidel Publishing, 1975. [29] Charles Spearman. The abilities of man. Macmillan, 1927. [30] David Tall. Looking at graphs through infinitesimal microscopes, windows and telescopes. The Mathematical Gazette, 64:22–49, 1980. [31] Todd L. Veldhuizen. C++ templates are Turing complete. Technical report, Indiana University, 2003. [32] Pei Wang and Patrick Hammer. Assumptions of decision-making models in AGI. In International Conference on Artificial General Intelligence, pages 197–207. Springer, 2015. [33] Christian Wirth, Riad Akrour, Gerhard Neumann, and Johannes Fürnkranz. A survey of preference-based reinforcement learning methods. The Journal of Machine Learning Research, 18(1):4945–4990, 2017. [34] Yufan Zhao, Michael R Kosorok, and Donglin Zeng. Reinforcement learning design for cancer clinical trials. Statistics in medicine, 28(26):3294–3315, 2009.