Center for the Study of Language and Information Report No. CSLI-88-126 Determination, Uniformity, and Relevance: Normative Criteria for Generalization and Reasoning by Analogy Tad R.. Davies Artificial Intdligence Center, SRI International and Department of Psychology, Stanford University CSLI was founded early in 1983 by re5ea.rcheu from Stanford University, SRI International, and Xerox PARC to further research and devdopment of integrated theories of la.nguage, information, and computation. CSLI headquarters and the publication offices are located a.t the Sta.nford site. CSLI/SRI InternatiolUll CSLI/Stanford 333 Ravenswood Avenue Ventura. Hall Menlo Park, CA 94025 Sta.nford, CA 94305 CSLI/Xerox P ARC 3333 Coyote Hill Road Pa.lo Alto, CA 94304 May 1988 The prepara.tion a.nd publication of this report have been made possible in part through a gift from the System Development Foundation. This paper will also appear in D.H. Helman, ed., Analogical Reasoning: Perspec~ives of Artificial Iñelligence, Cognitive Science. IJnd Philosophy. Dordrecht: D. Reidel. Copyright@1988 Todd Dl.vies Determination, Uniformity, and Relevance: Normative Criteria for Generalization and Reasoning by Analogy Todd R. Davies Artificial Intelligence Center SRI International and Department of Psychology Stanford University May 4, 1988 This paper will appear in Helman, D. H. (ed .) Analogical Reo$oning: Per&pedivu of Arlificilll Intelligence, Cognitive Science, and Philo,ophy. Dordrecht: D. Reidel Publishing Company (Synthese Library Series), in press. The research reported here was made possible in part by a grant from the System Development Foundation to the Center for the Study of Language and Information, and in part by the Office of Naval Research under Contract Nos. N00014-85-C-OOll and NOOOl485-C-0251. The views and conclusions contained in this document ace those of the author and should not be interpreted as representing the official policies, either expressed or implied, of the Office of Naval Research or the United States Government. 1 Introduction: The Importance of Prior Knowledge in Reasoning and Learning from Instances If an agent is to apply knowledge from its past experience to a present episode, it must know what properties of the past situation can justifiably be projected onto the present on the basis of the known similarity between the situations. The problem of specifying when to generalize or reason by analogy, and when not to, therefore looms large for the designer of a learning system. One would like to be able to program into the system a set of criteria for rule formation from which the sy'stem can correctly generalize from data as they are received. Otherwise, all of the necessary rules the agent or system uses must be programmed in ahead of time, so that they are either explicitly represented in the knowledge base or derivable from it. Much of the research in machine learning, from the early days when the robot Shakey was learning macro-operators for action [29J to more recent work on chunking [32] and explanation-based generalization [27], has involved getting systems to learn and represent explicitly rules and relations between concepts that could have been derived from the start. In Shakey's case, for e:l;{ample, the planning algorithm and knowledge about operators in STRIPS were jointly sufficient for deriving a plan to achieve a given goal. To say that Shakey "learned" a specific sequence of actions for achieving the goa1 means only that the plan was not derived until the goal first arose. Likewise, in explanation-based generalization (EBG), explaining why the training example is an instance of a concept requires knowing beforehand that the instance embodies a set of conditions sufficient for the concept to apply, and chunking, despite its power to simplify knowledge at the appropriate level, does not in the logician's terms add knowledge to the system. The desire to automate the acquisition of rules, without programming them into the system either implicitly or explicitly, has led to a good deal of the rest of the work in symbolic learning. Without attempting a real summary of this work, it can be said that much of it has involved defining heuristics for inferring general rules and for drawing conclusions by analogy. For example, Patrick Winston's program for learning and reasoning by analogy [43] attempted to measure how similar a source and target case were by counting equivalent corresponding attributes in a frame, and then projected an attribute from the source to the target if the count was large enough. In a similar vein, a popular criterion for enumerative induction of a general rule from instances is the number of times the rule has been observed to hold. Both types of inference, although they are undoubtedly 2 part of the story for how people reason inductively and are good beuristic methods for a. naive system,l are nonetheless frought with logical (a.nd practical) peril. In reasoning by analogy, for example, a large number of similarities between two children does not justify the conclusion that one child is named "Skippy" just because the other one is. First names are not properties tha.t can he projected with any plausibility based on the similarity in the childrens' appearance, although shirt size, if the right similarities are involved, can be. In enumerative induction, likewise, the formation of a general rule from a. number of instances of co-occurrence mayor may not be justified, as Nelson Goodman's weU:known unprojectible predicate "grue" makes very clear [15]. So in generalizing and reasoning by analogy we must bring a. good deal of prior knowledge to the situation to tell us whether the conclusions we might draw are justified. Tom Mitchell has ca.Ued the effects of this prior knowledge in gujding inference the inductive "bias" (26). A Logical Formulation of the Problem or Analogy Reasoning by analogy may be defined as the process of inferring that a conclusion property Q holds of a particular situation or object T (the target) from the fact that T shares a property or set of properties P with another situation/object S (the source) which has property Q. The set of common properties P is the similarity between S and T, and the conclusion property Q is projected from S onto T. The process may be summarized schematica.Uy as follows: P(S) A Q(S) P(T) Q(T). The form of argument defined above is nondeductive, in that its conclusion does not follow synta.ctica.Uy just from its premises. Instances of this argument form vary grea.tly in cogency. As an example, Bob's car and Sue's car share the property of being 1982 Mustang GLX V6 hatchbacks, but we could not infer that Bob's car is painted red just because Sue's car is painted red. The fact that Sue's car is worth about $3500 is, however, a good indication that Bob's car is worth about $3500. In the former example, the inference is not compelling; in the latter it is very probable, but the premises are true in both examples. Clearly the plausibility of the conclusion depends on information that is not provided in the premises. So the lSee the essay by Stuut Russell elsewhere in this volume. 3 justification aspect of the logical problem of analogy, which has been much studied in the field of philosophy (see, e.g. [6,19,23,42]), may be defined as follows. THE JUSTIFICATION PROBLEM: Find a criterion which, if satisfied by any particular analogical inference, sufficiently esta.blishes the truth of the projected con* clusion for the target case. Specifically, this may'be taken to be the task of specifying background know}* edge that, when added to the premises of the analogy, makes the conclusion follow soundly. It might be noticed tha.t the analogy process defined above can be bra. ken down into a twa.step argument as follows: (1) From the first premise P(S) " Q(S), conclude the generalizalion Vz P(z) => Q(z), and (2) instantiate the generalization to T and apply modus ponens to get the conclusion Q(T). In this process, only the first step is nondeductive, 60 it looks as if the problem of justifying the analogy has been reduced to the problem of justifying a single--instance inductive generalization. This will in fact be the assumption henceforth that the criteria for reasoning by analogy can be identified with those for the induction of a rule from one example. This amounts to the assumption that a set of similarities judged sufficient for projecting conclusions from the source to the target would remain sufficient for such a projection to any target case with the same set of similarities to the source. There are clearly differences in plausibility among different single* instance generalizations that should be revealed by correct criteria. For example, if inspection of a red robin reveals that its legs are longer than its beak, a projection of this conclusion onto unseen red robins is plausible, but projecting that the scratch on the first bird's beak will be observed on a second red robin is implausible. However, the criteria that allow us to dis* tinguish between good and bad generalizations from one instance ca.nnot do so on the basis of ma.ny of the considerations one would use for enumerative induction, when the number of cases is greater than one. The criteria for enumerative induction include (1) whether or not the conclusion property taken as a predicate is "entrenched" (unlike 'grue', for instance) [15], (2) how many instances have confirmed the generalization, (3) whether or not there are any known counterexamples to the rule that is to be inferred, and (4) how much variety there is in the confirming instances on dimensions other than those represented in the rule's antecedent [37]. When we have 4 informa.tion about only a. single instance of a. property pertinent to its associa.tion with another, then none of the a.bove criteria. will provide us with a. way to teU whether the generaliza.tion is a. good one. Criteria. for generalizing from a. single instance, or for reasoning by a.nalogy, must therefore be simpler than those required for general enumerative induction. Identifying these more specialized criteria. thus seems like a. good place to start in elucidating precise rules for induction. One approach to the analogy problem has been to rega.rd the conclusion as plausible in proportion to the amount of similarity that exists between the target and the source (see [251). Heuristic variants of this have been popular in research on analogy in a.rtificial intelligence (AI) (see, e.g. [4,43]). Insofar as these "similarity-based" methods and theories of analogy rely upon a measure over the two cases tha.t is independent of the conclusion to be projected, it is easy to see that they fail to account for the differences in plausibility among many analogical arguments. For example, in the problem of inferring properties of an unseen red robin from those of one already studied, the amount of similarity is fixed, namely that both things are red robins, but we are much happier to infer that the bodily proportions will be the same in both cases than to infer that the unseen robin will also have a scratched beak. It is worth emphasizing tha.t this is true no matter how well constructed the similarity metric is. Partly in response to this problem, researchers studying analogy ha.ve recently adverted to relevance as an important condition on the relation between the similarity and the conclusion [22,35]. However, to be a. useful criterion, the condition of the similarity P being relevant to the conclusion Q needs to be weaker than the inheritance rule 'r/z P(x) =? Q(z), for then the conclusion in plausible analogies would always follow just by application of the rule to the target. Inspection of the source would then be redundant. So a solution to the logical problem of analogy must, in addition to providing a justifica.tion for the conclusion, also ensure that the information provided by the source instance is used in the inference. We therefore have the following. THE NON REDUNDANCY PROBLEM: The background knowledge that justifies an analogy or singleinstance generalization should be insufficient to imply the CODclusion given information only about the target. The source insta.nce should provide new informa.tion about the conclusion. This condition rules out trivial solutions to the justification problem. In particular, although the additional premise 'r/z P(z) =? Q(x) is sufficient for 5 the validity oC the inference, it does not solve the nonredundancy problem and is thereCore inadequate as a general solution to the logical problem of analogy. To return to the example oC Bob's and Sue's cars, the nonredundancy requirement stipulates that it should not be possible, merely Crom knowing that Bob's car is a 1982 Mustang GLX V6 hatchback, and having some rules Cor calculating current value, to conclude that the value of Bob's car is about $3500-Cor then it would be unnecessary to invoke the information that Sue's CM is worth that amount. The role of the source analogue (or instance) would in that case be just to point to a conclusion which could then be verified independently by applying general knowledge directly to Bob's car. The nonredundancy requirement assumes, by contrast, that the information provided by the source instance is not implicit in other knowledge. This requirement is important iC reasoning Crom instances is to provide us with any conclusions that could not be inCerred otherwise. As was noted above, the rules formed in EBG*like systems are justified, but the instance inCormation is redundant, whereas in systems that use heuristics based on similarity to reason analogically, the conclusion is not inferrable Crom prior knowledge but is also not justified after an examination oC the source. There has been a good deal of fruitful work on different methods for learning by analogy (e.g., [3,4,5,16,22,43]) in which the logical problem is of secondary importance to the empirical usefulness of the methods for particular domains. Similarity measures, for instance, can prove to be a successful guide to analogizing when precise relevance information is unavailable, and the value of learning by chunking, EBG, and related methods should not be underestimated either. The wealth of engineering problems to which these methods and theories have been applied, as well as the psychological data they appear to explain, all attest to their importance for AI. In part, the current project can be seen as an attempt to fill the gap between similarity-based and explanation-based learning, by providing a way to infer conclusions whose justifications go beyond mere similarity but do not rely on the generalization being implicit in prior knowledge. In that respect, there will be suggestions of methods for doing analogical reasoning. The other, perhaps more important, goal of this research has been to provide an underlying normative justification for the plausibility of analogy from a logical and probabilistic perspective, and in so doing to provide a general form for the background knowledge that is sufficient for drawing reliable, nonredundant analogical inferences, regardless of the method used. The approach is intended to complement , rather than to compete with, other approaches. In particular it is not intended to provide a descriptive account of how people 6 reason by analogy or generalize from cases, in contrast to much of the work in cognitive psychology to date (e.g., [11,13]). Descriptive theories may also involve techniques that are not logically or sta.tistically sound. The hope is that, by elucidating what conclusions are justified, it will become easier to analyze descriptive and heuristic techniques to see why they work and when they fail. Determination Rules for Generalization and Analogical Inference Intuitively, it seems that a. criterion that simultaneously solves both the justification problem and the nonredundancy problem should be possible to give. As an example, consider again the two car owners, Bob and Sue, who both own 1982 Mustang GLX V6 hatchbacks in good condition. Bob talks to Sue and finds out that Sue has been offered $3500 on a. trade*in for her car. Bob therefore reasons tha.t he too could get about $3500 if he were to trade in his car. Now if we think about Bob's state of knowledge before he talked to Sue, we can imagine that Bob did not know and could not calculate how much his car was worth. So Sue's information was not redundant to Bob. At the same time, there seemed to be a prior expectation on Bob's part that, since Sue's car was also a 1982 Mustang GLX V6 hatchback in good condition, he could be relatively sure that whatever Sue had had offered to her, that would be about the value of his (Bob's) car as well, and indeed of any 1982 Mustang GLX V6 hatchback in good condition. What Bob knew prior to examining the instance (Sue's car) was some very general but powerful knowledge in the form of a determination relation, which turns out to be a solution to the justification and nonredundancy problems in reasoning by analogy. Specifically, Bob knew that the make, model, design, engine-.type, condition and year of a car determine its trade*in value. With knowledge of a single determination rule such as this one, Bob does not have to memorize (or even conSUlt) the Blue Book, or learn a complicated set of rules for calculating car values. A single example will tell him the value for aU cars of a particular make, model, design, engine, condition, and year. In the above example, Bob's knowledge. tha.t the make, model, design, engine, condition, and year determine the value of a car, expresses a de* termination relation between functions, and is therefore equivalent to what would be called a "functional dependency" in da.ta.base theory {39]. The logical definition for function G being functionally dependent on a.nother function F is the following [40]: (.) Vx,yF(x) = F(y) => G(x) = G(y). 7 In this case, we say that a. function (or set of functions) F functionally determines the value offunction(s) G because the value assignment for F is associated with a unique value assignment for G. We may know this to be true without knowing exactly which value for G goes with a. particula.r value for F. If the example of Bob's and Sue's cars (CarB and Cars respectively) from above is written in functional terms, as follows: Mak,(Cars) = Ford Model(Cars} = Mustang Design(Cars) = GLX Engine(Cars) = V6 Condition(Cars) = Good Y,ar(Cars) = 1982 Va/u,(Cars) = $3500 Va/u,(CarB) = $3500 Mak,(CarB) = Ford Model(CarB) = Mustang Design(CarB) = GLX Engine(CarB) = V6 Condition(CarB} = Good Y,ar(CarB) = 1982 then knowing that the make, model, design, engine, condition, and year determine value thus makes the conclusion valid. Another form of determination rule expresses the relation of one predicate deciding the truth value of another, which can be written as: ('*) (If. P(.) => Q(.)) V (If. P(.) => ,Q(.)). This says that either all P's are Q's, or none of them are. Having this assumption in a background theory is sufficient to guarantee the truth of the conclusion Q(T) from P(S) 1\ P(T) 1\ Q(S) , while at the same time requiring an inspection of the source case S to rule out one of the disjuncts. It is therefore a solution to both the justification problem and the nonredundancy problem. We often have knowledge of the form" P decides whether Q applies." Such rules express our belief in the rule-like relation between two properties, prior to knowledge of the direction of the relation. For example, we might assume that either all of the cars leaving San Francisco on the Golden Gate Bridge have to pay a toll, or none of them do. Other, more complicated formulas expressing determination relations can be represented. It is interesting to note that determination cannot be formulated as a connective, i.e. a relation between propositions or dosed formulas. Instead it should be thought of as a relation between predicate schemata, or open formulas. In the sema.ntics of determination presented in the next section, even the truth value of a predicate or schema is allowed to be a variable. Determina.tion is then defined as a relation between a determinant schema and its resultant schema, and the free variables that occur 8 only in the determinant are viewed as the predictors of the free variables that occur only in the resultant (the response variables). It is worth noting that there may be more than one determinant for any given resultant. For example, one's zip code and capitol city are each individually sufficient to determine one's state. In our generalized logical definition of determination (see the section on "Representa.tion and Semantics") , the forms (.) and (**) are subsumed as special cases of a single rela.tion "P determines Q," written as P >Q. Assertions of the form" P determines Q" are actually quite common in ordinary language. When we say "The IRS decides whether you get a. tax refund," or "What school you attend determines what courses are availa.ble," we are expressing an invariant relation that reflects a causal theory. At the same time, we are expressing weaker informa.tion than is contained in the statement that P formally implies2 Q. If P implies Q then P determines Q, but the reverse is not true, so the inheritance relation falls out as a special case of determination. That knowledge of a determination rule or of "relevance" underlies preferred analogical inferences seems transparent when one has considered the shortcomings of alternative criteria. like how similar the two cases are, or whether the similarity together with our background knowledge logically imply the conclusion. It is therefore surprising that even among very astute philosophers working on the logical justifications of analogy and induction, so much emphasis has until recently been placed on probabilistic analyses based on numbers of properties [6J, or on accounts that conclude that the analogue is redundant in any sound analogical argument (e.g., [7]). Paul Thagard and Richard Nisbett [37J speculate that the difficulty in specifying the principles that describe and justify inductive practice has resulted from an expectation on the part of philosophers that inductive principles would be like deductive ones in being capable of being formulated in terms of the syntactic structure of the premises and conclusions of inductive inferences. When, in 1953-54 Nelson Goodman [15J made his forceful argument for the importance of background knowledge in generalization, the Carnapian program of inductive logic began to look less attractive. Goodman was perhaps the first to take seriously the role and form of semantically-grounded background criteria (called by him "overhypotheses") for inductive inferences. The possibility of valid analogical reasoning was recognized by Julian Weitzenfeld (41], and Thagard and Nisbett [37J 2The term 'formal implication' is due to Bertrand Russell and refers to the relation between predicates P and Q in the inheritance rule 'h:P(l:) => Q(l:). 9 made the strong case for semantic (as opposed to syntactic, similarity. or numerica.l.ly-based) criteria for generalization. In the process both they a.nd Weitzenfeld a.nticipated the argument made herein concerning determination rules. The history of AI a.pproaches to analogy and induction has la.rgely recapitulated the stages tha.t were exhibited in philosophy. But the precision required for making computational use of determination, and for applying related statistical ideas, gives rise to questions a.bout the scope and meaning of the concepts that seem to demand a. slightly more formal analysis than has appeared in the philosophical literatUIe. In the next section, a general form is given for representing determInation rules in first order logic. The probabilistic analogue of determination, herein called "uniformity," is then defined in the following section, and finally the two notions-logical and statistical-are used in providing definitions of the relation of "relevance" for both the logical and the probabilistic cases . The Representation and Semantics of Determination To define the general logical form for determination in predicate logic, we need a representation that covers (1) determination of the truth value or polarity of an expression, as in example cases of the form "P( x) decides whether or not Q(x)" (formula (**) from previous section), (2) functional determination rules like (*) above, and (3) other cases in which one expression in first order logic determines another. Rules of the first form require us to extend the notion of a first order predicate schema in the following way. Because the truth value of a first order formula cannot be a defined function within the language, let us introduce the concept of a polar variable which can be placed at the beginning of an expression to denote that its truth value is not being specified by the expression. For example, the notation "iP(x)" can be read "whether or not P(x)," and it can appear on either side of the determination relation sign "}-" in a determination rule, as in This would be read, "Pt(x) and whether or not P2(x) together jointly determine whether or not Q(x)," where it and i l are polar variables. As was mentioned above, the determination relation cannot be formulated as a connective, i.e. a relation between propositions or dosed formulas. Instead, it should be thought of as a. relation between predicate schemata, or open formulas with polar variables. For a. first order language L, the set of predicate schemata for the language ma.y be characterized as follows. If 10 S is a sentence (closed formula. or wff) of L, then the following operations may be applied, in order, to S to generate a predicate schema: 1. Polar variables may be placed in front of any wffs that are contained as strings in 5, 2. Any object variables in S may be unbound (made free) by removing quantification for any part of 5, and 3. Any object constants in S may be replaced by object variables. All of a.nd only the expressions generated by these rules are schema.ta of L . To motivate the definition of determination, let us turn to some example pairs of schemata for which the determination relation holds. As an example of the use of polar variables, consider the rule that, being a. student athlete, one's school, year, sport, and whether one is female determine who one's coach is and whether or not one has to do sit-ups. This can be represented as follows: EXAMPLE 1: (Athlete(x) A Student(x) A School(x) = 8 AYear(x) = y A Sport(x) = z A i1Female(x» >- (Coach(x) = c A i,Sit-up8(X)). As a second example, to illustrate that the component schemata may contain quantified variables, consider the rule that, not having any deductions, having all your income from a corporate employer, and one's income determine one's tax rate: EXAMPLE 2: (Taxpayer(x) A Citizen(x, US)A (..,3d Deductions(x, d» A (Vi Income( i, x) ~ Corporate(i» A PersonaIIncome(x) = p) >- (TaxRate(x) = r) . In each of the above examples, the free variables in the component schemata ma.y be divided, relative to the determination rule, into a case set ~ of those that appear free in both the determinant (left-hand side) and the resultant (right-hand side) , a. predictor set 11.. of those that appear only in the determinant schema, and a response set ~ of those that appear only in the resultant . These sets are uniquely defined for each determination 11 rule. In particular, for example 1 they are ~ = {x}, 1'. = {8,y,z,i1}, and ;. = {c,i,}; and for example 2 they are" = {x}, 1£ = {p}, and;. = {r}. In general, for a. predicate schema E with free variables ~ and }I.. and a predicate schema X with free variables ~ (shared with E) and ~ (unshared), whether the determination relation holds is defined as follows: E [", l!l '" X [,t, .] iff Vll, ;.(3" E[", l!l A X[",.]) => (V" E[,t, lil => X[,t,.]). For interpreting the right-hand side of this formula, quantified polar variables range over the unary Boolean operators (nega.tion and affirmation) as their domain of constants, and the standard Tarskian semantics is a.pplied in evaluating truth in the usual way (see [10]). This definition covers the full range of determination rules expressible in first order logic, and is therefore more expressive than the set of rules restricted to dependencies between frame slots, given a fixed vocabulary of constants. Nonetheless, one way to view a predicate schema is as a frame, with slots c.orresponding to the free variables. Using Determination Rules in Deductive Systems Determination rules can provide the knowledge necessary for an agent or system to reason by analogy from case to case. This is desirable when the system builds up a memory of specific cases over time. H the case descrip* tions are thought of as conjunctions of well-formed formulas in predicate logic, for instance, then questions about the target case in such a. system can be answered as follows: 1. Identify a resultant schema. corresponding to the question being asked. The free variables in the schema are the ones to be bound (the response variables .i.). 2. Find a determination rule for the resultant schema, such that the determinant schema. is instantiated in the target case. 3. Find a. source case, in which the bindings for the predictor variables 1t in the determinant schema are identical to the bindings in the target case for the same variables. 12 ------------------------------~~~ 4. If the resultant schema is instantia.ted in the source case, then bind the shared free variables £ of the resultant schema to their values in the target case's instantiation of the determ.inant schema, and bind the response variables to their values in the source case's instantiation of the resultant schema. The well-formed formula thus produced is a sound conclusion for the target case. Such a system might start out with a knowledge base consisting only of determination rules that tell it wha.t informa.tion it needs to know in order to project conclusions by analogy, and as it acquires a. larger and larger database of cases, the system can draw more and more conclusions based on its previous experience. The determination rule also provides a. matching constraint in searching for a source case. Rather than seeking to maximize the similarity between the source and the target, a system using determinã tion rules looks for a case that matches the target on predictor bindings for a determinant schema, which mayor may not involve a long list of features that the two cases must have in common. A second use of determination rules is in the learning of generalizations. A single such rule, for example that one's species q.etermines whether one can fly or not, can generate a potentia.lly infinite number of more specific rules about which species can fly and which cannot, just from collecting case data on individual organisms that includes in each description the species and whether that individual can fly. So the suggestion for machine learñ ing systems that grows out of this work is that systems be programmed with knowledge about determination rules, from which they can form more specific rules of the form 'Ix P(x, Y) => Q(x, Z). Determination rules are a very common form of knowledge, perhaps even more 50 than knowledge about strict implication relationships. We know that whether you can carry a thing is determined by its size and weight, that a student athlete's coach is determined by his or her school, year, sport, and sex. In short, for many, possibly most, outcomes about which we are in doubt, we can name a set of functions or variables that jointly determine it, even though we often cannot predict the outcome from just these values. Some recent AI systems can be seen to embody the use of knowledge about determination relationships (e.g., see [l,5,31}). For example, Edwina IDssland and Kevin Ashley's program for reasoning from hypothetical caseS in law represents cases along dimensions which are, in a loose sense, deter~ minants of the verdicts. Likewise, research in the psychology and theory of induction and analogy (see, e.g. [30]) has postulated the existence of knowl~ 13 edge about the "homogeneity" of populations along different dimensions . In all of this work, the reality that full, indefeasible determination rules cannot be specified for complicated outcomes. and that many of the determination rules we can think of have exceptions to them. has prompted a view toward weaker relations of a partial or statistical nature [33]. and to determination rules that have the character of defaults (34]. The extension of the deter* mination relation to the statistical case is discussed in the next section on uniformity. A third use of determination rules is the representation of knowledge in a more compact and general form than is possible with inheritance rules. A single determination rule of the form P(x. y) >Q(x, z) can replace any number of rules of the form "'Ix P(x, Y) => Q(x, Z) with different constants Y and Z. Instead of saying, for instance, "Donkeys can't fly," "Humming* birds can fly," "Giraffes can't fly," and so forth , we can say "One's species determines whether or not one can fly," and allow cases to build up over time to construct the more specific rules. This should ease the knowledge acquisition task by making it more hierarchical. Uniformity: The Statistical Analogue of Determination The problem of finding a determining set of variables for predicting the value of another variable is similar to the problem faced by the applied statistician in search of a predictive model. Multiple regression, analysis of variance, and analysis of covariance techniques all involve the attempt to fit an equational model for the effects of a given set of independent (predictor) variables on a dependent (response) variable or vector (see [21,28]). In each case some statistic can be defined which summarizes that proportion of the variance in the response that is explained by the model (e.g. multiple R2, w2 ). In regression, this statistic is the square of the correlation between the observed and model-predicted values of the response variables, and is, in fact, often referred to as the "coefficient of determination" [211. When the value of such a statistic is 1, the predictor variables clearly amount to a determinant for the response variable. They are, in such cases, exhaustively relevant to determining its value in the same sense in which a particular schema determines a resultant in the logical case. But when the proportion of the variance explained by the model is less than 1, it is often difficult to say whether the imperfection of the model is that there are more variables that need to be added to determine the response, or that the equational form chosen (linear, logistic, etc .) is simply the wrong one. In low dimensions 14 (one or two predictors). a residual plot may reveal structure not captured in the model, but at higher dimensions this is not really possible, and the appearance of randomness in the residual plot is no guarantee in any case. So, importantly, the coefficient of determination and its analogues measure not the predictiveness of the independent variables for the dependents, but ra.ther the predictiveness of the model. This seems to be an inherent problem with quantitative variables. If one considers only categorical data, then it is possible to assess the predictiveness of one set of variables for determining another. However there are multiple possibilities for such a so-called "association measure." In the statistics litera.ture one finds three types of proposals (or such a. measure, tha.t is, a measure of the dependence between variables in a k*way con* tingency table of count data. Firstly, there are what have been termed "symmetric measures" (see (l7,I8]) that quantify the degree of dependence between two variables, such as Pearson's index of mean square contingency [18]. Secondly, there are "predictiveness" measures, such as Goodman and Kruskal's..\ [14), which quantify the proportional reduction in the probability of error, in estimating the value of one variable (or function) of an individual, that is afforded by knowing the value of another. And thirdly, there are in* formation theoretic measures (e.g. [381) that quantify the average reduction in uncertainty in one variable given another, and can be intepreted similarly to the predictive measures [18]. In searching for a sta.tistic that will play the role in probabilistic inference that is pla.yed by determination in logic, none of these three types of association measure appear to be what we are looking for. The symmetric measures can be ruled out immediately, since determination is not a symmetric relation. The predictive and information theoretic measures quantify how determined a variable is by another relative to prior knowledge about the value of the dependent variable. While this is a useful thing to know, it corresponds more closely to what in this paper is termed "relevance" (see next section), or the value of the information provided by a varia.ble relative to what we already know. Logical determination has the property that a schema can contain some superfluous information and still be a determinant for a given outcome; that is, information added to our knowledge when something is determined does not change the fact that it is determined, and this seems to be a useful property for the statistical analogue of determination to ha.ve. So a review of existing statistical measures apparently reveals no suit* able candidates for what will hereinafter be called the unifoMnity of one variable or function given the value of another, or the statistical version of 15 the determination relation. Initially we might be lid simply to identify the uniformity of a function G given another function F with the conditional proba.bility: Pr( G(z) = G(y) I F(z) = F(y)} for randomly selected pairs % and y in our population. Similarly, the uniformity of G given a. pa.rticular value (property or category) P might defined as: Pr( G(x) = G(y) I P(z) " P(y)}, and permutations of values a.nd variables in the arguments to the uniformity function could be defined along similar lines. This possibility is adverted to by Thagard and Nisbett [37]. though they are not concerned with exploring the possibility seriously. IT the uniformity statistic is to underlie our confidence in a particular value of G being shared by additional instances that share a. particular value of F, where this latter value is newly observed in our experience, then it seems that we will be better off, in calculating the uniformity of G given F, if we conditionalize on randomly chosen values of F, and then measure the probability of a match in values for G, rather than asking what is the probability of a match on G given a match on F for a randomly chosen pair of elements in our past experience, or in a population. An example should illustrate this distinction and its importance. If we are on a desert island and run across a bird of a species unfamiliar to us (say, "shreebles," to use Thagard and Nisbett 's term) and we further observe that this bird is green, we want the uniformity statistic to tell us, based on our past experience or knowledge of birds, how likely it is that the next shreeble we see will also be green. Let us say, for illustration, that we have experience with ten other species of birds, and that among these species nine of them are highly uniform with respect to color, but the other is highly varying. Moreover, let us assume that we have had far greater numerical exposure to this tenth, highly variable species, than to the others, or that this species (call them "variabirds") is a lot more numerous generally. Then if we were to define uniformity as was first suggested, sampling at random from our population of birds, we would attain a much lower value for uniformity than if we a.verage over species instead, for in the latter case we would have high uniformities for all but one of our known species and therefore the high relative population of variabirds would not skew our estimate. Intuitively the latter measure, based on averaging over species ra.ther than individuals in the conditional, provides a better estimate for the probability that the next shreeble we see will be green. The important point to realize is that 16 ----------------------------.=-~~ there are multiple possibilities for such a statistic, and we should choose the one that is most appropriate for what we want to know. For instance, if the problem is to find the probability of a match on color given a match on species for randomly selected pairs of birds, then the former measure would clearly be better. Another factor that plays in the calculation when we average over species is the relative confidence we have in the quality of each sample, i.e. the sample size for each value of F . We would want to weigh more heavily (by some procedure that is still to be specified) those values for which we have a good sample. Thus the uniformity statistic for estimating the probability of a match given a new value of F would be the weighted average, 1 p U(G I F) = L:w;Pr{ G(z) = G(y) I F(z) = F(y) = Pd, P i=1 where p is the number of values Pi of F for which we have observed instances and also know their values for G. In the absence of information about the relative quality of the samples for different values of F, all of the weights Wi would equal 1. How might we make use of such a statistic in learning and reasoning? Its value is that, under the assumption that the uniformity of one function given another can be inferred by sa.mpling, we can examine a relatively small sample of a population, tabulate data on the subsets of values appearing in the sample for the functions in question , and compute an estimate of the extent to which the value of one function is determined by the other. This will in turn tell us what confidence we can have in a generalization or inference by analogy based on a value for a predictor function (variable) co-occurring with a value for a response function, when either or both have not been observed before. The experience of most people in meeting speakers of foreign languages provides a good example. In the beginning, we might think, based on our early data, that one's nationality determines one's native language. But then we come across exceptionsSwitzerland, India, Canada. We still think that native language is highJy uniform given nationality, however, because its conditional uniformity is high. So in coming across someone from a country with which we are not familiar, we can assume that the probability is reasonably high that whatever language he or she speaks is likely to be the language that a randomly selected other person from that country speaks.3 'r am indebted to Stuart RUMell Cor this example, and Cor the .uggea:tion oC the term 17 Relevance: Logical and Statistical Definitions for the Value of Information The concepts of determination and uniformity d-efined above can be used to help answer another common question in learning and problem solving. Specifically, the question is, how should an agent decide whether to pay attention to a given variable? A first answer might be that one ought to attend to variables that determine or suggest high uniformity for a given outcome of interest. The problem is that both determination and uniformity fail to teU us whether a given variable is nece8sary for determining the outcome. For instance, the color of Smirdley's shirt determines how many steps the Statue of Liberty has, as determination has been defined, because the number of steps presumably does not change over time. As another example, one's zip code and how nice one's neighbors are determine what state one lives in, because zip code determines state. This property for determination and uniformity is useful because it ensures that superfluous facts will not get in the way of a sound inference. But when one's concern is what information needs to be sought or taken into account in determining an outcome, the limits of resource and time dictate that one should pay attention only to those variables that are relevant to determining it. The logical relation of relevance between two functions F and G may be loosely defined as follows: F is relevant to determining G if and only if F is a necessary part of some determinant of G. In particular, let us say that F is relevant to determining G iff there is some set of functions D such that (I) FED, (2) D ~ G, and (3) D - {F} does not determine G." We can now ask, for a given determinant of a function, which part of it is truly relevant to the determination, and which part gives us no additional information. Whether or not a given function has valueS to us in a given situation can thus be answered from information about whether it is relevant to a. particular goal. Relevance as here defined is a special case of the more general notion because we have used only functional determination in defining it. Nonetheless, this restricted version captures the importa.nt 'uniformity'. The definition of "paxtitJ. determination" given by Rusadl in [33] correspond, to the ,pecial cue of Uniformity in which tbe weight. axe each 1. tThis definition CaD euily be ausmented to cover the relevaDce of let. of fundio .. , aDd values, to othera. ~'Value' u used here refen only to uefuinesa for purpoeet of inference. 18 properties of relevance. Devika. Subramanian and Michael Genesereth [36] have recently done work demonstrating that knowledge about the ifTelevance of, in their examples, a particular proposition, to the solution of a logical problem, is useful in reformulating the problem to a more workable version in which only the aspects of the problem description that are necessary to solve it are represented. In a. similar vein, Michael Georgeff has shown that knowledge about independence a.mong subprocesses can eliminate the frame problem in modeling an unfolding process for planning [12]. Irrelevance and determination are dual concepts, and it is interesting that knowledge in both forms is important in reasoning. Irrelevance in the statistical case ca.n, on reflection, be seen to be related to the concept of probabilistic independence. In probability theory, an event A is said to be independent of an event B iff the conditional probability of A given B is the same as the marginal probability of A. The relation is symmetric. The statistical concept of irrelevance is a. symmetric relation as defined in this paper. The definition is the following: F is (statistically) irrelevant to detennining G jff U( G(z) = G(y) 1 F(x) = F(y)} = Pr( G(z) = G(y)}. That is, F is irrelevant to G if it provides no information about the value of G. For cases when irrelevance does not hold, one way to define the relevance of F to G is as follows: R(F,G) = 1 U( G(z) = G(y) 1 F(x) = F(y)} Pr( G(z) = G(y)} I. That is, relevance is the absolute value of the change in one's information about the value of G afforded by specifying the value of F. Clearly, if the value of G is known with probabilly 1 prior to inspection of F then F cannot provide any information and is irrelevant. If the prior is between 0 and 1, however, the value of F may be highly relevant to determining the value of G. It should be noted that relevance has been defined in terms of uniformity in the statistical case, just as it was defined in terms of determination in the logical case. The statistic of relevance is more similar to the predictive association measures mentioned in the last section for categorical data than is the uniformity statistic. As such it may be taken as another proposal for such a. measure. Relevance in the statistical case gives us a continuous measure of the value of knowing a pa.rticuiar function, or set of functions, or of knowing that a property holds of an iIidividuaJ, for purposes of determining another variable of interest. Knowledge a.bout the relevance of 19 variables can be highly useful in reasoning. In particular, coming up with a set of relevant functions, variables, or values for determining an outcome with high conditional uniformity should be the goal of an agent when the value of the outcome must be assessed indireetly. Conclusion The theory presented here is intended to provide normative justifications for conclusions projected by analogy from one case to another, and for generalization from a case to a rule. The lesson is not that techniques for reasoning by analogy must involve sentential representations of these criteria in order to draw reasonable conclusions. Rather it is that the soundness of such condusions, in either a logical or a probabilistic sense, can be identified with the extent to wltic.h the corresponding criteria (determination and uniformity) actually hold for the features being related. As such it attempts to answer what has to be true of the world in order for generalizations and analogical projections to be reliable, irrespective of the techniques used for deriving them. That the use of determination rules without substantial heuristic control knowledge may be intractable for systems with large case libraries does not therefore mean that determination or uniformity criteria are of no use in designing such systems. Rather, these criteria provide a standard against which practical tecltniques can be judged on normative grounds. At the same time, knowledge about what information is relevant for drawing a conclusion, either by satisfying the logical relation of relevance or by being significantly relevant in the probabilistic sense, can be used to prune the factors that are examined in attempting to generalize or reason by analogy. As was mentioned earlier, logic does not prescribe what techniques will be most useful for building systems that reason by analogy and generalize successfully from instances, but it does teU us what problem such techniques should solve in a tractable way. As such, it gives us what David Marr [241 called a "computational theory" of case-based reasoning, that can he applied irrespective of whether the (in Marr's terms) "algorithmic" or "implementational" theory involves theorem proving over sentences [9J or not. A full understanding of how analogical inference and generalization can be performed by computers as well as it is performed by human beings will surely require further investigations into how we measure similarity, how situations and rules are encoded and retrieved, and what heuristics can be used in projecting conclusions when a valid argument cannot be made. But it seems that logic can teU U8 quite a lot about analogy, by giving us a standard for 20 evaluating the truth of its conclusions, a general form for its justification, and a language for distinguishing it from other forms of inference. Moreover, analysis of the logical problem makes clear that an agent can bring background knowledge to bear on the episodes of its existence, and soundly infer from them regularities that could not have been inferred before. Acknowledgments Much of this paper is based on my senior thesis, submitted to Stanford University in 1985 and issued as [8] . J owe a great deal to my advisor for the project, John Perry, whose work with John Barwise on a theory of situations provided exactly the right framework for analysis of these issues [2]. In addition, I have profited greatly from discussions with Stuart Russell, Amos Tversky, Devika Subramanian, Benjamin Grosof, David Helman, Leslie Kaelbling, Kurt Konolige, Doug Edwards, Jerry Hobbs, Russ Greiner, David Israel, Michael Georgeft', Stan Rosenschein, Paul Rosenbloom, Anne Gardner, Evan Heit, Yvan Leclerc, Aaron Bobick, and J. O. Urmson. [Added for the CSLI Reports edition: Thanks also to Valerie Maslak of SRI for editing, to Kluwer Academic Publishers for approving the release of this report in advance of publication of the book Analogical Reasoning, and to Dikran Karagueuzian of CSLI for patiently waiting for, and producing, this report.] References [I) Baker, M. & Burstein, M. H. implementing a Model of Human Plausible Reasoning. In Proceedings 0/ the Tenth International Joint Gon- /erence on Artificial Intelligence (IJGAI-87). Los Altos, CA: Morgan Kaufmann, 1987, pp. 185-188. 12] Barwise, J. At Perry, J. Situations and Attitude8. Cambridge, MA: MIT Press, 1983. [3] Burstein, M. H. A Model of Incremental Analogical Reasoning and Debugging. In Proceeding8 0/ the National Con/erena! on Artificial Intelligence (AAAI-83). Los Altos, CA: Morgan Kaufmann, 1983, pp. 45-48. [4J Carbonell, J. G. Derivational Analogy and ]ts Role in Problem Solving. In Proceedings 0/ the National Gon/erena! on Artificial Intelligenct! (AAAI-83). Los Altos, CA: Morgan Kaufmann, 1983, pp. 64-69. 21 [5] Carbonell, J. G. Derivational Analogy: A Theory of Reconstructive Problem Solving and Expertise Acquisition. In Michalski, R. S., Carbonell, J. G., and Mitchell, T. M. (eds.), Machine Learning: An Artificial Intelligence Approach, Volume II. Los Altos, CA: Morgan Ka.ufmann, 1986, pp. 371-392. [6] Carnap, R. Logical Foundations of Probability. Chicago: University of Chicago Press, 1963. [7] Copi, I. M. Introduction to Logic. New York: The Macmillan Company, 1972. [8] Davies, T. Analogy. Informal Note No. IN-CSLI-85-4, Center for the Study of Language and Information, Stanford, CA, 1985. [91 Davies, T. R. & Russell, S. J. A Logical Approach to Reasoning by Analogy. In Proceedi.ngs of the Tenth International Joint Conference on Artificial Intelligence (IJCAI-87). Los Altos, CA: Morgan Kaufmann, 1987, pp. 264-270. Also issued as Technical Note 385, Artificial Intelligence Center, SRI International, Menlo Park, CA, July 1987. [101 Genesereth, M. R. & Nilsson, N. J. Logical Foundations 0/ Artifici.al Intelligence. Los Altos, CA: Morgan Kaufmann, 1987. (11] Gentner, D. Structure Mapping: A Theoretical Framework for Analogy. Cognitive Science, 7, 1983, pp. 155-170. [12] Georgeff, M. P. Many Agents Are Better Than One. Technical Note 417, Artificial Intelligence Center, SRI International, Menlo Park, CA, March 1987. (13] Gick, M. L. & Holyoak, K. J. Schema Induction and Analogical Transfer. Cognitive Psychology, 15, 1983, pp. 1-38_ (141 Goodman, L. A. & Kruskal, W. H. Measures 0/ Association/or Cross Classifications. New York: Springer-Verlag, 1979. [15] Goodman, N. Fact, Fiction, and Forecast. Cambridge, MA: Harvard University Press, 1983. (161 Greiner, R. Learning by Understanding Analogies. Technical Report STAN-CS-85-1071, Stanford University, Stanford, CA, December 1985. 22 [11] Haberman, S. J. Association, Measures of. In Kotz, S. & Johnson, N. L. (eds.), Encyclopedia of Statistical Science, Volume 1. New York: John Wiley and Sons, 1982, pp. 130-131. [18] Hays, W. L. & Winkler, R. L. Statistics, Volume II: Probability, Inference, ond Decision. San Francisco: Holt, Rinehart and Winston, 1970. [19] Hesse, M. B. Models ond Anologies in Science. Notre Dame, IN: University of Notre Dame Press, 1966. [20] Holland, J., Holyoak, K., Nisbett, R., and Thagard, P. Induction: Processes of Inference, Learning, and Discovery. Cambridge, MA: MIT Press, 1986. [21] Johnson, R. A. & Wichern, D. A. Applied Multivariate Statistical Analysis. Englewood Cliffs, NJ: Prentice-Hall, 1982. [22] Kedar-Cabelli, S. Purpose-directed Analogy. In The Seventh Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates, 1985, pp. 150-159. [23] Leblanc, H. A Rationale for Analogical Inference. Philosophical Studies, 20, 1969, pp. 29-3J. [24J Marr, D. Vision. New York: W. H. Freeman and Company, 1982. [25] Mill, J. S. A System of Logic. New York: Harper & Brothers Publishers, 1900. [26] Mitchell, T. M. The Need for Biases in Learning Generolizations. Technical Report CBM-TR-ll1, Rutgers University, New Brunswick, NJ, May 1980. [27] Mitchell, T. M., Keller, R. M., & Kedar-Cabelli, S. T. Explanationbased Generalization: A Unifying View. Machine Learning, 1, 1986. pp. 47-80. [28] Montgomery, D. C. & Peck, E. A. Introduction to Linear Regression Analysis. New York: John Wiley & Sons, 1982. [29] Nilsson, N. Shakey the Robot. Technical Note 323, Intelligence Center, SRI International, Menlo Park, CA, April 1984. 23 [30] Nisbett, R. E., Krantz, D. H., Jepson, D., & Kunda, Z. The Use of Statistical Heuristics in Everyday Inductive Reasoning. Psychological Review, 90, 1983, pp. 339-363. [31] russland , E. L & Ashley, K. D. Hypotheticals as Heuristic Device. In Proceedings of the National Conference on Artificial Intelligence (AAAI-86). Los Altos, CA: Morgan Kaufmann, 1986, pp. 289-297. [32] Rosenbloom, P. S. & Newell, A. The Chunking of Goal Hierarchies: A Generalized Model of Practice. In Michalski , R. 5., Carbonell, J. G., & Mitchell, T. M. (eds.), Machine Learning: An Artificial Intelligence Approach, Volume II. Los Altos, CA: Morgan Kaufmann, 1986, pp. 247-288. [33] Russell, S. J . Analogical and Inductive Inference. Ph.D. Thesis, Stanford University, Stanford, CA, December 1986. [34] Russell, S. J. & Grosof, B. N. A Declarative Approach to Bias in Inductive Concept Learning. In Proceedings of the National Conference on Artificial Intelligence (AAAI-87). Los Altos, CA: Morgan Kaufmann, 1987, pp. 505-510. [35] Shaw, W. H. & Ashley, L. R. Analogy and Inference. Dialogue: Canadian Journal of Philosophy, 22, 1983, pp. 415-432. [36] Subramanian, D. & Genesereth, M. R. The Relevance of Irrelevance. In Proceedings of the Tenth International Joint Conference on Artificial Intelligence (IJGAI-87). Los Altos, CA: Morgan Kaufmann , 1987, pp. 416-422. [37] Thagard. P. and Nisbett, R. E. Variability and Confirmation. Philosophical Studies, 42, 1982, pp. 379-394. [38] Theil, H. On the Estimation of Relationships Involving Qualitative Variables. American Journal of Sociology, 76, 1970, pp. 103-154. [39] Ullman, J. D. Principles of Database Systems. Rockville, MD: Computer Science Press, 1983. [40J Vardi, M. Y. The Implication and Finite Implication Problems for Typed Template Dependencies. Technical Report STAN-CS*82-912, Stanford University, Stanford. CA, 1982. 24 [41] Weitzenfeld, J. S. Valid Reasoning by Analogy. Philosophy of Science, 51, 1984, pp. 137*149. [42) Wilson, P. R. On the Argument by Analogy. Philosophy of Science, 91, 1964, pp. 34*39. [43) Winston, P. H. Learning and Reasoning by Analogy. Communications of the Association for Computing Machinery, 23, 1980, pp. 689.703.