Automated Theorem Proving and Its Prospects Review of Automated Development of Fundamental Mathematical Theories by Art Quaife Desmond Fearnley-Sander Department of Mathematics University of Tasmania HOBART TAS 7001 AUSTRALIA dfs@hilbert.maths.utas.edu.au Copyright (c) Desmond Fearnley-Sander 1996 Received: December 28, 1995; Accepted: March 17, 1996; Published: June 19, 1996 PSYCHE, 2(28), June 1996 http://psyche.cs.monash.edu.au/v2/psyche-2-28-fearnley_sande.html KEYWORDS: automated theorem proving, OTTER, Quaife, resolution, understanding. REVIEW OF: Automated Development of Fundamental Mathematical Theories by Art Quaife (1992). Kluwer Academic Publishers. 271pp. $US123 hbk. ISBN: 0-7923-2021-2. 1. Introduction In his recent books one of the main pre-occupations of Roger Penrose has been to show (to prove!) that computers are intrinsically limited, compared to humans, when it comes to the doing of mathematics. Even those who think that such things can be proved may be interested in the empirical question: what mathematics can computers do? Art Quaife's book addresses the aspect of doing mathematics that is reported in mathematical journals: proof of theorems. It shows us the extent to which this high exercise of rationality can currently be automated, and the extent to which it cannot. To me this book was fascinating. After reading it, I found myself turning over in my mind the question of the role played in theorem proving by understanding, a quality that Penrose regards as a prime desideratum of consciousness. I'll come back to that at the end of this review. Using the theorem prover OTTER Art Quaife has proved four hundred theorems of von Neumann-Bernays-Gödel set theory; twelve hundred theorems and definitions of elementary number theory; dozens of Euclidean geometry theorems; and Gödel's incompleteness theorems. It is an impressive achievement. To gauge its significance and to see what prospects it offers one must look closely at the book and the proofs it presents. But first, some basics. 2. Resolution theorem proving Mainstream automated theorem proving, to which Quaife's work belongs, is based on the method of resolution. To get an idea of it, three fundamental notions are needed: sentence, theorem and proof. Classical logic provides us with a notion of sentence that is completely formal. A person or computer can mechanically check whether or not a given string of words is a (grammatically correct) sentence. A typical sentence like: for_all x, for_all y, x*y=y*x may admit interpretation in various mathematical domains; this particular one happens to be true for multiplication of real numbers but not for multiplication of 2-by-2 matrices. The sentence for_all x, there_exists y, x*y=1 is true if * is interpreted as addition of real numbers (and 1 as the number we call 'one', or even as the real number we call 'zero'), but it is not true if * is interpreted as multiplication of real numbers and (and 1 as the number we call 'one'). A theorem has a bunch of sentences called its hypotheses, and a single sentence called its conclusion; what makes it a theorem is the existence of a proof that the conclusion follows from the hypotheses. A proof is a guarantee that under any interpretation in which the hypotheses are all true the conclusion must also true. While there are many methods of proof, resolution has the advantage of being more susceptible than most to machine implementation. Resolution is a method of proof by contradiction: to show the conclusion C follows from the hypotheses H1, H2, ... , Hn we say "suppose on the contrary that (in some interpretation) the hypotheses are true but the conclusion is not" and proceed to show, if we can, that the set of sentences {H1, H2, ... , Hn, not C} is contradictory. Our proof will succeed if we can find a sentence S such that both S and not S follow from the given sentences. What is required, then, to prove a theorem is to show that an associated set of sentences is unsatisfiable. In fact one may always construct the associated set so that its sentences have a particularly simple form: each sentence is a disjunction of literals -a literal being an atomic sentence or the negation of an atomic sentence -that is universally quantified over all its variables. Sentences of this simple form are called clauses; for brevity, it is normal in writing clauses, to omit the universal quantifier and to replace each or (connecting the literals) by a comma. For example the clause {not is_a_man(x), is_mortal(x)} is a formal abbreviation for the sentence "for every x, either x is not a man or x is mortal"; which is more familiar, perhaps, in the logically equivalent form "all men are mortal." Let us prove the famous theorem that has this and the clause {is_a_man(Socrates)} as its hypotheses and {is_mortal(Socrates)} as its conclusion. The method of resolution requires that a contradiction be derived from the set of three clauses: {{not is_a_man(x), is_mortal(x)}, {is_a_man(Socrates)}, {not is_mortal(Socrates)}} Using the first and the second of the clauses we may derive the new clause {is_mortal(Socrates)}. This is a very simple instance of what is called a resolution step; we will not go into detail here, but all that is involved is careful, elaborate, systematic matching and generation of patterns of symbols. Adding the new clause to our set, we now have: {{not is_a_man(x), is_mortal(x)}, {is_a_man(Socrates)}, {not is_mortal(Socrates)}, {is_mortal(Socrates)}} and there, staring us in the face, is the contradiction we need. A resolution prover would perform a second resolution step, by convention deriving the null clause {} from the final two clauses above, and ending with {{not is_a_man(x), is_mortal(x)}, {is_a_man(Socrates)}, {}} It is the presence of the null clause that signals success -a contradiction has been derived. This may seem a mindless way to prove the quintessential example of obviousness, but it's the way that the method of resolution works. One keeps adding new clauses to those already generated, sometimes (as in the second step of our example) deleting old ones. Resolution is the main way of generating new clauses but another kind of step, called factoring is also needed. Any other sound way of deriving new clauses may be used. If we succeed in deriving the empty clause we are done: the theorem is proved. What makes this worth trying is the fact that for any unsatisfiable set of clauses there exists a finite sequence of resolution and factoring steps that leads to a contradiction. We only need to find it. Quaife's book actually uses a more perspicuous Prolog-like notation in which A, B -> C, D stands for the clause {not A, not B, C, D}, and displays resolution proofs in familiar linear fashion, presenting each new clause that is generated with a brief explanation of its origin. For example, the argument above would appear in the form: 1 is_a_man(x) -> is_mortal(x) 2 -> is_a_man(Socrates) 3 is_mortal(Socrates) -> 4 -> is_mortal(Socrates) [resolution, 1, 2] 5 -> [resolution, 3, 4] 3. OTTER OTTER was developed at Argonne National Laboratory by William McCune (1990). It is based on work of Larry Wos, Ross Overbeek, Ewing Lusk and others (Lusk and Overbeek 1982, Wos et al. 1965, Wos et al. 1967, Boyer et al. 1986) that goes back over more than two decades and stems ultimately from the resolution algorithm of Alan Robinson (1963). OTTER is a state-of-the-art theorem prover incorporating many refinements of the basic resolution method (though at the time of publication it lacked associative-commutative unification). Theorem-proving, from this point of view, is like searching an immense, a mindbogglingly immense, maze. If the statement we are trying to prove is indeed a theorem -and of course we are not supposed to know that it is then among the many paths that lead from the set of clauses at hand to sets of derived clauses there will be some that lead to an exit -a set that includes a null clause. If the statement happens not to be a theorem, then the maze has no exit. We accumulate clauses in the hope that they may play a role in filling in the path we seek. In the vast amount of information that is generated, most is dross. The problem is to control the derivation process in such a way that the null clause comes along before the information at hand becomes too large for any computer to handle. And this is to be done without semantic input, without understanding. The author quotes Henri Poincaré, with apparent approval: Thus be it understood, to demonstrate a theorem, it is neither necessary nor even advantageous to know what it means. It is an uncharacteristic sentence. When one tracks it down -a non-trivial task, since like all of Quaife's many quotations it is unreferenced -it can be seen that Poincaré was only paraphrasing sentiments of Hilbert with which he disagreed. The relevant passage is in Chapter 3 of Poincaré's Science and Method (1958, p. 147). In Quaife's work binary resolution is abandoned and sophisticated variants, hyperresolution and UR-resolution are used instead. As he puts it (p. 14) these "make larger inference steps in one fell swoop, without saving intermediate results to further clog up the clause space. They are effective steps in fighting the combinatorial explosion." In addition, reasoning about equality is done using paramodulation and demodulation. The latter brings in the powerful algebraic device of rewriting to normal form. A most important role is also played by two common-sense strategies: set-ofsupport, which ensures that each conclusion drawn is relevant to the theorem at hand, and subsumption, which discards a derived clause if it is less general than some other derived clause. OTTER offers the user a substantial degree of control over the process of generating new clauses. In particular, each generated clause is assigned a weight that may be used to discard the clause (if its weight is large) or to give it preference in the drawing of further inferences (if it is small). More importantly, the program, like human mathematicians, can invoke previously proved results, however the choice of which theorems to prove first is up to the user of the program. The question must be asked: can proofs obtained with so much input from the human user be said to be automatic? The author calls them "semi-automatic" and, revealingly, he consistently refers to them in the possessive as "my proofs." Referring to the range of possibilities between a prover that merely verifies each minute step of an argument entirely presented to it by the user and one that presents a complete proof given only the statement of the theorem (including the axioms of the relevant theory), he makes a case that his development is a positive step in the direction of automation. He argues that whereas earlier resolution provers were successful only when relevant lemmas were supplied by the user, in his presentation all previously proved lemmas are freely available. There is therefore less human intervention. Art Quaife is optimistic that human intervention can be eliminated. Are there grounds for this optimism? In the case, for example, of number theory the order of proof of theorems and introduction of concepts is grounded in hundreds of years of human experience. Deep semantic insights of generations of creative mathematicians have been called upon. If there are ways of doing away with that, none are offered in this book. To quote Poincaré again (1958, p. 193): Logic remains barren unless it is fertilized by intuition. 4. Set theory The first of the theories considered is von Neumann-Bernays-Gödel set theory. The author offers his own modifications of a published clausal version of Gödel's axioms and says (p. 26): Considerations of machine efficiency will be important in my conversion, and thus the clauses I supply do not result from a direct clausification of Gödel's axioms. Rather, I believe that with proper definitions they are provably equivalent to his. As elsewhere his honesty is to be commended, but in the context -where accuracy is essential and subtle errors are easily made -such ad hoc moves are worrying. Perhaps the main value of the book, to a non-expert reader, is the inside view of the actuality of resolution-based automated theorem proving that is offered by passages like the following, made in connection with the set theory material (p. 52): In my first attempt at a proof, I turn on UR-resolution, paramodulation into and from, and back demodulation. If these settings fail to obtain a proof within a reasonable length of time, I try turning on hyperresolution. I then also assign low weights to other functors that I expect must appear in the proof. This step is frequently necessary. The discipline of adhering to general inference rules must usually be departed from by bringing in theorem-specific weightings. A more radical departure lies in the choice of the many lemmas that proof of a major theorem usually requires. For example, in one of the culminating set-theory results presented in this section -Cantor's theorem that there is no function mapping a set onto its power set -the essence of the diagonal argument (the definition of the Cantor class and its membership conditions) must be provided by the user and proved as lemmas prior to the theorem itself. 5. Number theory The book includes a very large number of theorems of elementary number theory that have been proved semi-automatically on the basis of Peano's axioms. It must be realized that the list includes such items as the definition (pp. 192-3) -> ((x+(y-x)) = max(x,y)) (where y-x is defined to be 0 if y x) and the theorems -> (max(0,y) = y) and -> (max(x,x) = x). Fully-formalized arithmetic is broad as well as deep! The culminating results are two famous theorems of Euclid. The first is the irrationality of the square root of 2 (or indeed any prime number). The negation of the theorem is expressed in clausal form as (p. 81): PRIME(p). -> ((p.(a.a)) = (b.b)). (gcd(a,b) = 0) -> . What is actually proved is that a prime cannot be a ratio of squares of natural numbers; the reason, the author says, is that "I haven't yet developed the theory of fractions, let alone square roots." The proof takes only 25 lines. But it must be realized that the thirteen lemmas that were needed to produce it include such minor items as (p. 81) ((u.(x.x)) = (y.y)) -> ((u.((x/gcd(x,y)).(x/gcd(x,y)))) = ((y/gcd(x,y)).(y/gcd(x,y)))). and such major items as (gcd(x,y) = 1), DIV(x,(y.z)) -> DIV(x,z). While it is true that these lemmas are selected by the prover from among the four hundred-odd previously proved theorems, their particularity strongly suggests how vast a gulf must be bridged to fully automate the proof. The author speculates (p. 91) "that if a human has not proved the Goldbach conjecture within about forty years, the odds will shift in favor of a machine first finding the proof." Forty years is a good deal longer than the usual span for AI predictions, and if one will accept a proof with hundreds of lemmas, carefully chosen by the human user, following the research of human mathematicians, the prediction is not outlandish. A fully automatic proof would be a completely different matter. So far as we know the background needed for proving deep results in number theory goes far beyond what might be expected from the simplicity of the axioms and the theorem statements. There is no indication in this work that a resolution prover might in time automatically develop the theory of elliptic functions in response to a request for a proof of Fermat's last theorem. 6. Euclidean geometry Euclidean geometry appears to be an ideal prospective domain for automatic proof. The pioneering work of Gelerntner (1963) (see also Gilmore 1970) served mainly to show that the appearance is deceptive. Axiomatic presentations of Euclidean geometry have not proved to be very amenable to automation. The development presented by Art Quaife confirms that observation. He uses a complete axiomatization, due to Alfred Tarski (1951), based on equidistance and betweenness as primitive relations on points. The most difficult of the theorems proved is that the diagonals of a rectangle bisect each other. The proof invokes 23 previously proved results and takes 555 seconds on a VAX 8800 running ULTRIX 2.0. It seems like overkill for a theorem that holds for parallellograms in general, not just rectangles, and that basically comes down to the trivial vector implication B-A = C-D => 1/2(A+C)= 1/2(B+D). The idea that an algebraic approach might pay off is not mistaken. Quaife briefly mentions the work done by Shang-Ching Chou (1988) at the University of Texas during the nineteen-eighties. Chou implemented an algorithm of the eminent Chinese mathematician Wu Wen-Tsun (1986) and was able to efficiently generate fully automatic proofs of hundreds of theorems of Euclidean geometry, including theorems that humans find difficult to prove such as Simson's theorem. Lemmas are not required. The significance of this tour de force has not adequately been recognized. If runs on the board are what counts, it is the major achievement to date in automated theorem proving. It suggests the possibility that an entirely different approach to automatic reasoning -in which deep, domain-specific, computational knowledge reduces the need to search, or does away with it entirely -may pay off in areas other than geometry. 7. Logic It is not surprising that the main successes of Quaife's prover and others like it have been in proving theorems of logic. Here if anywhere the method is suited to the matter. In the final chapter, semiautomatic proofs are offered of Löb's theorem and of Gödel's first and second incompleteness theorems. This is achieved within a formalization of the metatheory of the modal logic K4 -it is not theorems in logic that are proved, but theorems about logical theorems -with heavy use of demodulation. As with the other chapters there is a nice brief presentation of the relevant background. The final hundred pages or so of the book are devoted to an edited list of the theorems proved in NBG set theory and Peano arithmetic, material that should be useful to other workers in the area. For them the book may well be as valuable as its extraordinary price suggests. 8. Conclusion Art Quaife writes clearly, honestly and with infectious enthusiasm. His interesting book is a compilation of research papers which retains the excitement and focus on detail that his creative use of OTTER engendered. I can't help liking a scholarly book that carries the ancient Greek or southern Californian dedication: To the pursuit and realization of unlimited pleasure. Perhaps the book is mis-named. It contains no automated development of theories. On the contrary the development of the theories that are presented is carried out entirely by the author, with a combination of skill, experience, trial and error and, above all, knowledge. Little explanation is offered of empirical facts that might play a role in the further automation of the proof process, such as that hyper-resolution is preferred to URresolution (with the latter apparently mainly switched off) for the Tarski geometry whereas the reverse is the case for Peano arithmetic. Before sceptics will grant that what is going on here is truly automatic theorem proving it will be necessary to automate the control that currently must be exercised by the user of a resolution-based theorem prover. One may doubt whether that is feasible. Application of the cut rule, the rule of classical logic that allows the use of lemmas, is the main problem; it is here that knowledge is brought to bear which the program has no access to. To circumvent the problem, a different line of attack might pay off. The hard bits are easy and the easy bits are hard. That is a view that some AI people have. It means something like this: it's easy to prove Gödel's theorems but hard to recognize your mother. For a machine, that is. A look at mainstream automated theoremproving shows that matters are not so simple. The hard bits are hard too. The real work in proving a deep theorem lies in the development of the theory that it belongs to and its relationships to other theories, the design of definitions and axioms, the selection of good inference rules, and the recognition and proof of more basic theorems. Currently, no resolution-based program, when faced with the stark problem of proving a hard theorem, can do all this. That is not surprising. No person can either. Remarks about standing on the shoulders of giants are not just false modesty. Great theorems require great theories and theories do not, it seems, emerge from thin air. Their creation requires sweat, knowledge, imagination, genius, collaboration and time. As yet there is not much serious collaboration of machines with one another, and we are only just beginning to see real symbiosis between people and machines in the exercise of rationality. 9. Afterword: the role of understanding Quaife's experience seems to confirm Penrose's beliefs concerning understanding. What the program OTTER lacks is understanding. It has no conception of which results are important, let alone which are interesting. It lacks the ability, surely a component of human approaches to proof, to reject purported theorems that are plainly false on semantic grounds (such as the use of symmetry or of diagrams in the case of geometry). Art Quaife seems not to feel a need for such ingredients, but others in automated theorem proving do. They face very difficult problems of implementation, especially if they abhor ad hoc methods. But that is not the end of the story, or even of the story so far. Think about Chou's program, nicknamed the China prover. It will certainly beat any mathematician who's not seen it before to finding a proof of a difficult geometry theorem such as Morley's theorem (which is the assertion that the trisectors of the angles of any triangle meet at the vertices of an equilateral triangle). Does the China Prover understand? Many would say no. A less glib answer might be: it understands enough to prove the theorems. The point is that a certain kind of understanding, admittedly quite different to the kind that you and I have, is embedded in the algorithm that the prover uses. Unlike OTTER's method, the algorithm has deep mathematical content. Humans, of course, are responsible for the algorithm and the associated theory and proofs of correctness. Most prominent among them is David Hilbert. That's a nice touch on the part of whoever is writing the script for all this. It would be easy, but a mistake on several grounds, to dismiss the success of the China Prover because it is in the quaint field of Euclidean geometry. For many centuries Euclidean geometry was regarded (rightly) as one of the peaks of human intellectual achievement, and it remains the geometry of the world in which robots and most people operate. Moreover, essential features of the geometry theorem proving algorithm do carry over to other domains. On the question of understanding there is one further point that is quite telling, I think. It was observed by Chou as he ran his prover that many long-accepted results of Euclidean geometry were in fact subtly incorrect. Humans tend not to notice tricky but common special cases in which geometry theorems break down. It is the mistake we observe in undergraduates who forget that x must not be zero when they divide by it. The China prover does not need to be given such degeneracy conditions: it actually generates them. Who understands better, then, man or machine? Notes A very brief version of this article appeared in the Australian Computer Journal, February 1995. References Boyer, R., Lusk, E., McCune, W., Overbeek, R., Stickel, M. and Wos, L. (1986) Set theory in first-order logic: Clauses for Gödel's axioms. Journal of Automated Reasoning 2, 287-327. Chou, S. (1988) Mechanical geometry theorem proving. Reidel. Gelerntner, H. (1963) Realization of a geometry theorem-proving machine. In Feigenbaum and Feldman (eds.) Computers and thought, 134-152. McGraw-Hill. Gilmore, P. C. (1970) An examination of the geometry theorem-proving machine. Artificial Intelligence 1, 171-187. Lusk, E. and Overbeek, R. (1982) Experiments with resolution-based theorem-proving algorithms. Computers and Mathematics with Applications 8, 141-152 (1982). McCune, W. (1990) Otter 2.0 user's guide. ANL-90/9, Argonne National Laboratory. Poincaré, H. (1958) Science and method. Dover. Robinson, J. A. (1963) A machine oriented logic based on the resolution principle. Journal of the Association for Computing Machinery 10, 163-174. Tarski, A. (1951) A decision method for elementary algebra and elementary geometry. University of California Press. Wen-Tsun, Wu. (1986) Basic principles of mechanical theorem proving in geometries. Journal of the Automated Reasoning 2, 221-252. Wos, L., Robinson, G. and Carson, D. (1965) Efficiency and completeness of the set of support strategy in theorem proving. Journal of the Association for Computing Machinery 12, 536-541. Wos, L., Robinson, G., Carson, D. and Shalla, L. (1967) The concept of demodulation in theorem proving. Journal of the Association for Computing Machinery 14, 698-709.