Semantic Information Measure with Two Types of Probability for Falsification and Confirmation1 Chenguang Lu Survival99(a)gmail.com Home Page: http://survivor99.com/lcg/english Abstract: Logical Probability (LP) is strictly distinguished from Statistical Probability (SP). To measure semantic information or confirm hypotheses, we need to use sampling distribution (conditional SP function) to test or confirm fuzzy truth function (conditional LP function). The Semantic Information Measure (SIM) proposed is compatible with Shannon's information theory and Fisher's likelihood method. It can ensure that the less the LP of a predicate is and the larger the true value of the proposition is, the more information there is. So the SIM can be used as Popper's information criterion for falsification or test. The SIM also allows us to optimize the true-value of counterexamples or degrees of disbelief in a hypothesis to get the optimized degree of belief, i. e. Degree of Confirmation (DOC). To explain confirmation, this paper 1) provides the calculation method of the DOC of universal hypotheses; 2) discusses how to resolve Raven Paradox with new DOC and its increment; 3) derives the DOC of rapid HIV tests: DOC of "+" =1-(1-specificity)/sensitivity, which is similar to Likelihood Ratio (=sensitivity/(1-specificity)) but has the upper limit 1; 4) discusses negative DOC for excessive affirmations, wrong hypotheses, or lies; and 5) discusses the DOC of general hypotheses with GPS as example. 1. Introduction Popper's method of falsification (1935/1959; 1963/2005) uses semantic information as criterion to test and evaluate hypotheses. Therefore, it needs a proper Semantic Information Measure (SIM). Modern inductive method (Hempel, 1945; Carnap, 1952; Eells, 2000; Hawthorne, 2004/2012) on the other hand, uses samples to confirm hypotheses. As a result, it needs a proper confirmation measure, i. e., Degree of Confirmation (DOC). There have been many SIMs (Bar-Hillel and Rudolf, 1952; Klir, 2005; Floridi, 2004, 2005/2015; Adriaans ,2010; D'Alfonso, 2011) and DOCs (Carnap, 1952; Popper, 1963/2005, 388; Earman,1992; Milne, 1996; Joyce, 1999; Christensen. 1999; Fitelson, 1999; Tentori et al, 2007). However, this paper tries to provide more reasonable SIM and DOC, so that the two methods (Falsification and 1 Revised on 2016‐09‐22. This paper is a summary of the author's studies on philosophy of science. It provides a new frame of semantic information and inductive logic (see Figure 1), and involves in too many sensitive questions to be accepted by most journals perhaps. If the reader has no patience to read the full paper, he may only has a look at some figures, tables, and equations to understand the author's thought roughly. The author intends to select some contents from it to write two or three shorter papers focusing on specific topics to submit somewhere. Welcome to criticize and cooperate. 2 Confirmation) are mutually compatible and moreover, compatible with Shannon's Information Theory (1948) and Fisher's Likelihood method. (Aldrich, 1997) After Akaike (1974) revealed relationship between Fisher's LM and Kullback-Leibler divergence (1951), an information measure, some researchers, including the author, realized that we should use a number of samples instead of one to construct both SIM and likelihood for falsification and confirmation. (Hawthorne, 2004/2014) The author had proposed a SIM with sampling distribution (conditional LP) and Fuzzy Truth Function (FTF, conditional LP) (Lu, 1991, 1993, 1999). This measure is compatible with Shannon's Information Theory and Popper's Falsification Theory or Hypothesis-testing Theory. Recently, the author found that this measure were also compatible with Fisher's LM and could be used for DOC. In researching about birds' sexual selection, the author proposed a hypothesis that the colorful plumages of many birds reflect their demands for foods. While measuring the semantic information of hypothesis "Birds with yellow feather like eating nectar or pollen", the author found that he could modify the true value of counterexamples or Degree of Disbelief (DOD) in a hypothesis to reduce information loss from the counterexamples and increase the average semantic information. The Degree of Belief (DOB) is determined by the equation DOD=1-|DOB|. The DOC is defined as the optimized DOB with SIM as criterion. Hence the author concludes that for any universal hypothesis, given a sampling distribution, the optimized DOB, i.e. DOC, exists and allows the average semantic information to reaches its upper limit: Kullback-Leibler Information, a special case of Shannon Mutual Information. The methods introduced in this paper are largely different from those popular methods. First, this paper strictly distinguishes Statistical Probability (SP, denoted by P) from Logical Probability (LP, denoted by T), but use two types of probability together to test and confirm hypotheses. The P means probability in which an event occurs, and T means probability in which a hypothesis is judged true by different people or in different cases. The SP is divided into objective probability (in which real events occur) and subjective probability or likelihood (predicted by hypotheses or their truth functions). Second, this paper strictly distinguishes the LP of a predicate (the less, the better) from the true value of the corresponding proposition (the larger, the better), and avoids talking "the logical probability of a proposition". In this paper, small letters e1, e2... em denote different individuals or evidences in set A; capital E denotes a variable taking a value from A; that is E∈A= {e1, e2... em}. E =ei means that ei occurs. Similarly, H denotes one of hypotheses or predicates h1, h2... hn in set B, and H∈B= {h1, h2... hn}. H=hj means that hj is selected. After hj is selected, if E=ei, then there is proposition hj(ei). Aj denotes a fuzzy subset of A (Zadeh, 1965) so that the membership grade ∈[0,1] of ei in Aj is the fuzzy true value of hj(ei), denoted by T(hj|ei) or T(Aj|ei). Hence the truth function of predicate hj(E) is T(hj|E) or T(Aj|E). To resove Bar-hillel-Carnap Paradox (BCP), Floridi (2004) emphasizes to use truthlikeness to measure semantic information. The fuzzy true-value T(Aj|ei) is the truthlikeness used by the author. 3 Figure 1 Two types of probability for Information and Confirmation --A new frame for hypothesis testing and induction2 Figure 1 shows how two types of probability are put together to form semantic information measure for tests, confirmations, and optimizations of hypotheses. The solid arrows means that we may get an item from other items. For example, we may get LP T(Aj) from P(E) and T(Aj|E). Yet P(hj|E) is an exception. In some cases, P(hj|E) that already exists is prior unknown and can only be obtained by experiments, or derived from P(E), P(E|hj), and P(hj) by Bayes' formula. The dotted arrows means the test, confirmation, or optimization of a hypothesis by objective conditional probability functions. This paper first introduces the content in Figure 1, beginning from Fuzzy Truth Function (FTF); and then discusses the calculations of DOC in various cases, which is the main task of this paper. 2 A list of symbols is appended at the end of this paper. 4 2. Fuzzy Truth Function, Logical Probability, Statistical Probability, and Likelihood In daily language, the truth-falsity of a statement is generally fuzzy. For example, statement "The thief is about 20 years old" is fuzzy. Its true value should be between 0 and 1. If the thief is actually 20 years old, the true value is 1. If there is deviation, true value will be less. For example, if his age is 25 years, the true value is about 0.8. If his age is 50 years, the true value is close to 0. So, the interval of true values is [0, 1] instead of {0, 1}. In the following sections, the "truth function" means FTF in most cases. A typical example of semantic communication is weather forecast. Let E denote rainfall and H denote a rainfall forecast. A possible forecast is hj="There will be moderate or heavy rain tomorrow". Another typical example of semantic communication is numerical prediction or estimation: "E is about ej" which may be written in hj(E) ="E≈ej". In mathematics, "E≈ej" is often denoted by êj. Estimations include not only those made in natural language and mathematics, but also the arrow of the Global Positioning System (GPS), the needles of a watch, or the indicators of various meters, and even a color sensation, as shown in Table 1. Zadeh (1965) uses membership function ( ) jA m E to define a fuzzy set Aj. The membership function also means the truth function of a predicate hj(E) ="E∈Aj". This paper follows Zadeh to define truth function T(hj|E). To emphasize the semantic meaning hj(E) = "E∈Aj", we also write T(hj|E) as T(Aj|E) in some cases. That is to define ( | ) ( | ) ( ) jj j A T h E T A E m E  (1) This function displays as a curve as shown in Figure 2. When E=ei, T(Aj|E) becomes the true value T(Aj| ei) of a proposition. Table 1 Estimations (hj="E≈ej") and their true values Examples Estimation hj="E≈ej" Evidence E, a variable Evidence ei, a constant T(Aj|ei), true value of hj(ei) Daily language "The thief is about 20 years old" Real age 18 years old 0.9 Economical prediction "The stock index will go up about 20% this year" Real rising percentage 5% 0.3 balance Reading of a balance, such as "1KG" Real weight 0.9KG 0.2 GPS Arrow ↖ on a map Real position Right 10 meters away 0.8 Color vision color sensation such as yellow sensation Real color with some dominant wavelength Color with dominant wavelength 570 nm (typical yellow). 1 5 We could also treat ej, which makes T(Aj|ej) =1, as Idea (proposed by Ancient Greek philosopher Plato) of Aj, and hence membership grade ( ) jA i m e is similarity degree or confusion probability of ei with the Idea ej. So, we use not only the occurring probability of objective messages as in the classical information theory but also subjective confusion probability. Where do truth functions come from? The truth functions of natural language come from usages. Later, the author will prove that T(Aj|E) comes from selecting rule function P(hj|E). Without knowing past P(hj|E), we could still get T(Aj|E) from the statistics of a random set (Wang and Sanchez, 1982). Actually, a fuzzy hypothesis (or its truth function) is similar to a predictive model (or its parameters' set) in statistics, such as in the Maximum Likelihood Method (MLE) (Aldrich, 1997). If hj is an unbiased estimation, its truth function may be approximately written as T(Aj|E) =exp [-(E-ej) 2/(2d2)] (2) where d is standard deviation. The larger the d is, the fuzzier the estimation is. Note that the maximum of truth function T(Aj|E) is 1. Unlike other popular methods for measuring semantic information, this paper strictly distinguishes LP from SP, and uses both to measure semantic information. First, E or ei is a fact or evidence. It only has SP P(E), without LP T(E). In practice, if one suspects that an evidence is false, one may use a hypothesis with true value between 0 and 1 to replace it. Second, a hypothesis hj=hj(E) has both SP (selected probability P(hj)) in which hj is selected, and also LP (denoted by T(hj) =T(Aj)) in which hj(E) is judged true. They are generally different. Consider hypotheses h1="There will be small rain", h2="There will be moderate rain", and h3="There will be small to moderate rain". According to their semantic meanings, T(h3) ≈T(h1)+T(h2); yet, there may be P(h3) <P(h1). The LP of tautology is 1; yet its selected probability is close to 0. Third, SP is normalized (an exception will be talked late), for examples, P(e1)+P(e2)+...+P(em)=1; P(h1)+P(h2)+...+P(hn)=1, and P(e1|hj)+P(e2|hj)+...+P(em|hj)=1; Yet, LP is not normalized and has the maximum 1. Generally, T(Aj|e1)+ T(Aj|e2)+...+ T(Aj|em)>1; T(A1)+ T(A2)+...+ T(An)>1 because A1, A1, ..., An are not disjoint. Only when they are disjoint and hypotheses h1, h2... hn are always correctly selected, T(Aj) =P(hj), j=1, 2... n. Averaging truth function, we get the LP of predicate hj (E): ( ) ( ) ( | )j i j i i T A P e T A e (3) This is just the fuzzy set probability defined by Zadeh. (1986) Note that the LP of a predicate in this paper is different from the true value of a predicate in mathematical logic, which is equal to T(hj|e1)∧T(hj|e2)∧...∧T(hj|em), because 1) the true value in Mathematical Logic can only be 0 or 1, yet the LP can be any value between 0 and 1; 2) the true value is irrelative to P(E), yet the LP is related to P(E); 3) the true value is posterior, yet, the LP is prior. 6 The LP defined by Eq. (3) is also different from the LP defined by Bar-hillel and Carnap (1952) and others (Floridi, 2004), which is also irrelative to P(E). For example, according to their definition, three predicates divide the logical space into 23=8 lattices and hence the minimum of the LP is 1/8. However, according Eq. (3), the LP T(Aj) depends not only on the coverage of Aj, but also on the probability distribution of those E over Aj. For example, although "That man is over 100 years old" has larger coverage than "That man is about 60 years old", its LP is much less because P(men's age>100) is very small. Therefore, the smaller LP of a hypothesis means that the event described is more specific and more occasional. It is specificity and occasionality that Popper uses to explain the severity of tests and his information criterion. Strictly speaking, the term "logical probability of a proposition" is improper, because a proposition only has true vale rather than LP. We could treat the T(Aj) as the prior LP of hypothesis hj, and the true value T(Aj|ei) as the posterior LP. Because of improper usage of this term, researchers are often puzzled by the question: whether larger LP or less LP of a proposition is better? The author believes that less LP and larger true value are better (or less prior LP and larger posterior LP are better), because the less the LP is, the severer the test is; the larger the true value is, the better the hypothesis survives the test. When hj is selected, the probability of E is P(E|hj); while hj is true, probability of E should be (Author, 19) ( ) ( | ) ( | ) ( ) j j j P E T A E P E A T A  (4) This formula may be called semantic Bayes' formula, which establishes the relationship between SP and LP. In terms of MLE, Aj (or T(Aj|E)) is a predictive model, P(E|Aj) is the likelihood function. Note that the peak of likelihood P(E|Aj) is between the peak of T(Aj|E) and the peak of P(E) as illustrated by Figure 2. Figure 2 Likelihood P(E|Aj) locates between truth function T(Aj|E) and source P(E) In the popular methods, P(E|b) is used as the probability distribution of E for given background knowledge b. In this paper, equally probable distribution P(E) ≡1/m 7 means no background knowledge; P(E) has carried background knowledge already. In other words, P(E) means P(E|b), or say, b is omitted in this paper. The author uses P(E) in a way similar to the way Shannon uses P(X). (1948) 3. Semantic Information Measure for Falsification According to classical information theory, relative information formula (Rosie, 1966) is: ( | ) ( ; ) log ( ) i j i j i P e h I e h P e  (5) This formula is the core of Shannon's mutual information formula (1948). Averaging I(ei; hj), we will get Shannon's mutual information I(E; H). However, Shannon never used this formula. The reason is that the use of this formula may bring negative information; yet Shannon's formulas of entropy and mutual information only measure mean information, which is always positive. The author believes that negative information is possible and meaningful to semantic information because if we believe lies or wrong predictions, the information will be negative. To replace hj in Eq. (5) with hj is true so that Eq. (5) becomes (Author, 19 ): ( | is true) ( | ) ( ; ) log log ( ) ( ) i j i j i j i i P e h P e A I e h P e P e  (6) This formula is similar to the formula proposed by Popper for severity of tests (1963/2005, 526) and the formula proposed by Milne (1996) for DOC. However, differences are that 1) The hj is replaced by Aj which clearly means that hj is true rather than hj is selected; 2) Background knowledge b is omitted here. According to Eq. (4) and (6), we get the Semantic Information Formula (SIF): ( | ) ( ; ) log ( ) j i i j j T A e I e h T A  (7) which is illustrated in Figure 3. Figure 3 The illustration of semantic information formula 8 Floridi (2004) follows Popper (1963/2005, 526, 534) to emphasize truthlikeness or verisimilitude for semantic information. It is T(Aj|E) that is used for truthlikeness. The above semantic information measure has three characteristics: 1. To determine the amount of information, we need to test the prediction by the evidence. When the evidence is exactly consistent with the prediction, that is ei=ej, the information reaches the maximum. The information decreases as deviation increases. When deviation reaches a certain level, information is negative. This relationship just right manifests ordinary error criterion. 2. The smaller the LP T(Aj) (i. e., the lower the horizontal line in Figure 3) is and the larger the true value T(Aj|ei) is, the larger the I(ei; hj) is. This exactly manifests Popper's notion: the smaller the logical probability of a hypothesis is, the more information there is if it can survive tests. 3. The information is 0 for a tautology or a contradiction. Popper affirms that a tautology contains no information, because it is not testable or logically non-falsifiable. The above formula reaches the same conclusion. For a tautology, T(Aj|E)≡1, T(Aj)=P(e1)+P(e2)+...+P(em)=1; so, I(ei; hj)=log(1/1)=0. It is also easy to avoid Bar-hillel-Carnap Paradox (Floridi, 2004, 2005/2015) by this formula. For a contradiction, T(Aj|E) ≡0 and T(Aj) =0. So, I(ei; hj) =log(0/0). Since log(0+/0+) =log1=0 (0+ is an infinitesimal). So, it is reasonable to think I(ei; hj) =0 for a contradiction. Now let's consider the information of estimation "E≈ej" with the truth function T(Aj|E) =exp [-(E-ej) 2/(2d2)]. The SIF may be written as 2 2( ; ) log[ ( | ) / ( )] log[1/ ( )] ( ) / 2 )i j j i j j i jI e h T A e T A T A e e d    ( (8) The above formula may be understood as Information= Testing severity Relative deviation. Popper defined Testing severity and Verisimilitude (1963/2005, 526, 534). Since LP and SP are not well distinguished by him, his definitions are not satisfactory. The author suggests defining log [1/T(Aj)] as testing severity, and T(Aj|ei)/T(Aj) as verisimilitude. In terms of LM, P(ei|Ai)/P(ei) =T(Aj|ei)/T(Aj) is also called standard likelihood. So, we may say Semantic information = log (Standard likelihood) = log (Verisimilitude)=Testing severity Relative deviation If negative verisimilitude for lies or wrong predictions is expected, one may also define verisimilitude by log [T(Aj|ei)/T(Aj)]. Averaging I(ei; hj) for different i in Eq. (7), we can obtain Average Semantic Information (ASI) of hypothesis hj: ( | ) ( ; ) ( | ) log ( ) j i j i j i j T A e I E h P e h T A  (9) 9 which is called Average Semantic Information Formula (ASIF) where P(ei| hj), i=1, 2... m is sampling distribution from statistics as a group of evidences. According to this formula, if there is a counterexample ei for which P(ei| hj)>0 and T(Aj|ei) =0, yet T(Aj)0, then the average information is -∞. This coincides with Popper's assertion: One exception is enough to falsify a universal hypothesis. However, this assertion can only be applied to non-fuzzy hypotheses. How can we test fuzzy hypotheses (such as "People with high Triglyceride probably also have fatty liver" and "There will be small to moderate rain tomorrow")? Popper did not offer a proper method. In daily life and the field of Social Science, most hypotheses are fuzzy. The above formula allows a reasonable evaluation of these hypotheses under the frame of Popper's theory and avoids negative infinite information. The above formula may also be written as ( | ) ( ; ) ( | ) log ( ) i j j i j i i P e A I E h P e h P e  (10) where P(ei| Aj), i=1, 2... m is likelihood function and may be understood as theoretical prediction; P(ei), i. e., P(ei|b), i=1, 2... m may be understood as prior likelihood, background knowledge, or context. This formula is a generalization of Kullback-Leibler (KL) formula (1951) and may be called Generalized Kullback-Leibler Formula (GKLF), which is illustrated in Figure 4. The information measured by Eq. (9) and (10) also has coding meaning (Lu, 1994, 2012). Figure 4 The illustration of generalized Kullback-Leibler formula The I(E; hj) may be called generalized KL information. It can be written as the difference of two KL divergences: ( | ) ( | ) ( ; ) ( | ) log - ( | ) log ( ) ( | ) i j i j j i j i j i ii i j P e h P e h I E h P e h P e h P e P e A   (11) Since KL divergence is larger than or equal to 0, the information reaches its maximum when ( | ) ( | )i j i jP e A P e h , i=1, 2... m (12) 10 so that the second part is 0. The maximum is equivalent to the KL information. Therefore, KL information is the upper limit of generalized KL information. First, the generalized KL information conforms to ordinary error criterion: Consistency is good. Second, it ensures that the more different P(E| hj) is from P(E), the more information is conveyed if P(E| Aj) is close to P(E| hj). This is the very manifestation of Popper's viewpoint: the more unexpected a prediction is and thus the severer tests it undergoes, the more information it conveys if it can survive tests. 4. Compatibility between Generalized KL Information and Maximum Likelihood Estimation Akeike (1974) revealed the relationship between KL formula and MLE. The author will explain the relationship between GKLF and MLE below. Let Z denote an observed condition, C= {z1, z2... zw} be a set of independent conditions, and Z∈C. For given Z=zk, the conditional probability of E is P(E| zk). Assume some elements in C result in the similar P(E|. ), we merge these elements into a subset Cj of C. Then if, and only if Z∈Cj, we select hj. Hence P(E|hj) =P(E|Z∈Cj), denoted by P(E|Cj), so that the GKLF becomes ( | ) ( ; ) ( | ) log ( ) i j j i j i i P e A I E h P e C P e  (13) The likelihood (function) of a predictive model θ (or its parameters' set) is defined as L(θ|E) =P(E|θ). To train the model, we use w samples e(1), e(2)... e(w) ∈ A. Then the likelihood becomes L(θ|Ew) =P(Ew|θ). If these samples are independent, then P(Ew|θ)=P(e(1)|θ)P(e(1)|θ)...P(e(w)|θ) (14) Assume there are wi samples that are ei among w samples, i=1, 2... m, hence ( | ( | ) i ww i i P E P e ) (15) Assume these samples occur under condition Z∈Cj, and w is enough large. Hence P(ei|Cj) =wi/w; the logarithm of P(E w|θ) becomes log ( | ) ( | ) log ( | )w i j i i P E w P e C P e   (16) Compare Eq. (13) and (16), it is easy to find that a fuzzy set Aj is equivalent to a model θ. Seeking a truth function T(Aj|E) with optimal parameters that result in the maximum I(E; hj) is equivalent to seeking the optimal parameters of the model θ that result in the maximum likelihood. The difference is that the semantic information method separates model (truth function T(Aj|E)) and source P(E), and gets likelihood P(E|Aj) by semantic Bayes' formula Eq. (4). Yet, the likelihood method does not separate them and directly constructs the likelihood P(E|θ). So, the semantic information method can be used for MLE when source P(E) is variable. 11 From Eq. (12), we know that the average information I(E; hj) reaches its maximum when T(Aj|E) =T(Aj)P(hj|E)/P(hj) (17) This is the inverse formula of semantic Bayes' formula (4). Assume when E=ej*, P(hj|E) has the maximum P(hj|ej*). Let the maximum of T(Aj|E) be T(Aj|ej*) =1, we get optimized truth function T(Aj|E)=P(hj|E)/P(hj| ej*)=P(E|hj)/P(E)/[P(ej*)/P(ej*|hj)] (18) For MLE, if the number (w) of samples is big enough, then the above equation becomes T(Aj|E)=P(Cj|E)/P(Cj|ej*)=P(E|Cj)/P(E)/ [P(ej*)/P(ej*|Cj)], j=1, 2... n (19) The above estimation method may be called the Maximum Semantic Information Estimation (MSIE). The Eq. (18) or (19) may be called Fuzzy Information Criterion (FIC) of estimations. Note that the conditional probability function P(H|ej) is normalized; yet P(hj|E) is not normalized because the left hj is not a variable. That means that generally, P(hj| e1)+ P(hj| e2)+...+ P(hj| em) ≠1 (which may be seen in Table 7). We may call P(hj|E) Selecting Rule Function of hj. Note that all P(hj|E), j=1, 2... n, form Shannon's channel P(H|E). So, Eq. (18) indicates how semantic channel matches Shannon channel to convey most information. Philosopher Weitgenstein has a famous standpoint (1958, 80): the meaning of a word lies in its use. Obviously, Eq. (18) supports this standpoint. When the audience or listeners continue to improve their understanding, forecasters or speakers also continue to improve their selecting rules of sentences, including selecting hj according to observed condition Z. Language is evolving in this way. 5. The Optimization of Degree of Disbelief The Degree of Belief (DOB) is a degree to which one believes a hypothesis. It is subjective and prior. After the hypothesis is tested by a series of samples, one gets the optimized DOB, i.e., DOC. So, DOC is the degree of inductive support (Hawthorne, 2005). Since now, let's use b to denote the DOB, b* to denote the optimized b or DOC, b'=1-|b| to denote the DOD, and b'* to denote the optimized DOD or degree of disconfirmation. In his studies of aesthetics and birds' sexual selection, the author proposed a hypothesis that colorful plumages of most male birds were selected by female birds and female tastes for beauty came from and indicated their demands for foods or environment. For example, the male peacock mimics the berry tree to attract the female because they like eating berries. First, the demanding relationship between the peacock and berries selected the peacock's taste for beauty; lately, the female taste for beauty selected the male feathers. When the author tried to use statistical data and semantic information measure to test and evaluate the hypothesis h1="Birds with yellow feathers like eating nectar or pollen", he found that increasing the true value of counterexamples properly could reduce the information loss and increase the average 12 semantic information I(E; h1). The true value of the counterexamples may be regarded as b', the DOD to h1. The following is the introduction of this method. In natural language, to reduce the information loss from counterexamples, we may use two methods to increase the fuzziness of hypotheses predictions. One is to use words such as "about" or "similar" to increase the fuzziness of a hypothesis by decreasing its precision. The d in Eq. (8) indicates the precision. The larger the d is, the lower the precision is, and hence the fuzzier the hypothesis is. Another method is to use words such as "probably" or "plausibly" to decrease audience's DOB b. The less the b is, the fuzzier (or more like a tautology) the hypothesis is. The audience may adjust again the DOB in a hypothesis. For example, generally, people give lower DOB to economists' predictions; slightly higher DOB to weather forecasts and medical diagnoses; higher DOB to GPS, watches, and thermometers. For this reason, the DOB b of a predicate hj(E) (instead of a proposition) is defined by T(hj b|E) =b'+bT(Aj|E), for b>0 (20) where hj is the initial hypothesis, which is fuzzy or non-fuzzy; hj b is hj with DOB b; b'=1-|b| is the degree of disbelief. This definition actually treats the DOD b' as the proportion of the tautology in the hj with b. Figure 5 shows an example of modifying the true-value of counterexamples from 0 to b'for the non-fuzzy hypothesis hj="There wil be small rain (rainfall betwee 0 and 5 mm) tomorrow". Figure 5 Modifying the true-value of non-fuzzy hypothesis from 0 to b' when counterexamples exist. A non-fuzzy universal hypothesis is defined as: h1="For all E, if E is in S1, then E is also in S2" 13 where S1 and S2 are two (non-fuzzy) sets. If we believe h1 to some degree b, h1 becomes a fuzzy hypothesis, denoted by h1 b. Consider these fuzzy inferences (with more or less counterexamples): "Birds with yellow feathers probably like eating nectar or pollen", "People with high triglyceride probably have fatty livers", "HIV-positive people are very probably infected by HIV", and "Probably all swans are white". They may be called fuzzy universal hypotheses. Assume all evidences in A can be divided into two types (as shown in Table 2): e1 (in S2) and e0 (not in S2), or four kinds: e11 (in S1 and S2), e00 (not in S1 and not in S2), e10 (in S1 and not in S2), and e01 (not in S1 and in S2). Table 2 Evidences divided into four types for a universal hypothesis E e1∈S2 e0S2 E∈S1 e11 e10 ES1 e01 e00 The fuzzy universal hypothesis h1 b is defined as: h1 b="For all E, if E is in S1, then E is also in A1" where b is the DOB in h1 and A1 is the fuzzified S2; the elements of A1 makes h1 b true. Let S1' and S2' be the supplementary sets of S1 and S2 respectively. We define h0 b0 as h0 b0 ="For all E, if E is in S1', then E is also in A0" where b0 is the DOB in h0 and A0 is the fuzzified S2'; the elements of A0 make h0 b0 true. For the universal hypothesis h1, T(h1|e11) =1 and T(h1|e10) =0. For the fuzzy universal hypothesis h1 b, according to Eq. (20), T(A1|e11) =1 and T(A1|e10) =b'. Similarly, T(A0|e00) =1 and T(A0 |e01) =b0'. The four truth values of two fuzzy inferences are shown in Table 3. Table 3 Four truth values of two fuzzy inferences e1∈S2 e0S2 E∈S1 , h1 b T(A1|e11)=1 T(A1|e10)=b'=1-|b| ES1, h0b0 T(A0 |e01)=b0'=1-|b0| T(A0|e00)=1 Let P1=P(e1), P0=P(e0), Q1=P(e1|h1) and Q0=P(e0|h1). Using Eq. (3), we get T(A1) =b'P0+P1. Using Eq. (9), we get 1 0 1 0 1 0 1 ' 1 ( ; ) log log ' ' b bI E h Q Q b P P b P P     (21) According to the above formula, when b=1 or b'=0 which means that the listeners fully believe h1, if there is a counterexample, the information will be -∞. When b=0 or b'=1 which means that the listeners do not believe h1 at all, the average 14 information is 0. We can seek b' (for 0≤b'≤1) that makes I(E; h1 b) reach its maximum (see Figure 5). Figure 6 Information I(E; h1 b) changes with degree of disbelief b' for P0/P1=0.8/0.2; Q0/Q1=0.25/0.75 Let derivative dI(E; h1 b)/db'=0, we get Q0P1-Q1P0b'=0. Hence there is b'*= (Q0/Q1)/(P0/P1) (22) Since the second derivative is less than 0 for b'= b'*. So, b'* is the optimized b' that maximizes I(E; h1 b). Below is the discussion about Eq. (22) only for cases where Q0/Q1<P0/P1. If Q0/Q1>P0/P1, we need another formula to get b'*. When P0<P1, P0/P1 may be called prior Absolute Degree of Disbelief (ADOB); when Q0<Q1, Q0/Q1 may be called posterior ADOD. So, b'* may be regarded as the decrement of ADOD. If P0=P1=0.5, then P0/P1=1 and b'*= Q0/Q1 which means that without background knowledge P(E), the b'* is equal to the posterior ADOD. From ADOD, we could get absolute conformation measure (Huber, 2005). From Eq. (20) and (22), we get b*=1-(Q0/Q1)/(P0/P1) (23) If Q0=0 and P0>0, then b' *=0 and b*=1, which means the hypothesis h1 is completely confirmed. Eq. (22) may also be written as b'*=(Q0/P0)/(Q1/P1) which means that DOD decreases with counterexamples' decreasing and positive examples' increasing. According to Eq. (18), we may directly get b'*= (Q0/P0)/(Q1/P1) =P(h1|e10)/P(h1|e11) (24) So, there is also b*=1-b'*=1P(h1|e10)/ P(h1|e11) (25) which means that the DOC of h1 is only related to selecting rule function P(h1|E) and truth function T(h1|E) but source P(E). Note that test or semantic information is related to P(E). 15 The Eq. (23) looks like a formula from Likelihoodism, yet the Eq. (25) looks like a formula from Bayesianism. The Eq. (24) shows that Likelihoodism and Bayesianism (Fitelson, 2007) can be compatible when sampling distribution is used to confirm a hypothesis or its truth function (conditional LP function). However, every P in above eqations means SP rather than LP. The conditional LP (function) is to be confirmed, and hence does not occur in these formulas. Otherwise, that is to let itself support itself. With the semantic Bayes' formula (4), we can use b'* and P(e1) to calculate predicted probability of e1, denoted by P(e1|h1 b*). That is P(e1|h1 b*)=P(e1)/[P(e1)+b' *P(e0)] (26) Figure 6 shows how P(e1|h1 b*) and I(E; h1 b*) are positively related to b*. Figure 7 Relations of b* to accuracy rate P(e1|h1 b*) and information I(E; h1 b*) for prior probability P(e1)=0.2. Now we use both b* and I(E; h1 b*) to evaluate and compare two hypotheses. There are 843 representative birds in the book The Illustrated Encyclopedia of Birds of the World (Alderton, 2005). Those birds could be divided into four types by whether they have yellow feathers (including orange feathers) and whether they eat nectar (including pollen), as shown in Table 4, where n11 is the number of e11, and so on. Table 4 The optimization of the degrees of disbelief b' about birds eating nectar (e1) not eating (e0) with yellow feathers (h1) n11= 83 n10=57 without yellow feathers (h0) n01=17 n00=686 P(h1|E) n11/( n01+n11)=0.830 n10/( n00+ n10)=0.0767 T(A1 b|E) 1 b'*=0.0924 According to Eq. (25) and Table 4, there is 16 * 10 01 1110 11 00 10 01 11 11 00 10 ( ) ' ( ) n n nn n b n n n n n n n       (27) Hence, b'*=0.0924, b*=1-b'*=1-0.0924=0.908, and I(h1 b, E) =0.923 bit. This information is equal to the KL information. Consider inference h1= "If a person has high triglyceride, then he also has fatty liver". Data from health examines of 142 people3 are shown in Table 5. Table 5 The optimization of the degrees of disbelieve b' about fatty liver numbers fatty liver (e1) non-fatty liver (e0) high triglyceride (h1) n11=25 n10=16 low triglyceride (h0) n01=41 n00=60 T(A1|E) 1 b'*=0.556 According to Table 5, the DOC b*=0.444 and the information I(E; h1 b*) =0.025 bit. Comparing the two inferences, obviously the inference about birds is more informative and more believable then the inference about fatty liver. 6. Raven Paradox and Rapid HIV Tests Hempel (1945) describes the paradox in terms of the hypothesis: (1) "All ravens are black". It is equivalent to: (2) "All non-black things are not ravens". A non-black non-raven thing (e00) such as a white chalk supports (2) and hence also supports (1). Yet, according common knowledge, a white chalk is irrelative to (1). So, there is a paradox. There have been many articles about this paradox. (Good, 1960; Scheffler and Goodman, 1972; Maher, 1999; Fitelson and Hawthorne, 2010) Now let s1= "E is in S1", s2 = "E is in S2", and s1->s2 = "If E is in S1 then E is in S2", and so on. To resolve the paradox, there are two ways generally. One is to deny Equivalence Condition (EC) (s1->s2 is equivalent to ¬s2->¬s1); another is to deny Irrelevance (e00 is irrelative to s1->s2). Most researchers like Hempel affirm the EC and deny the Irrelevance, and believe e11 can support h1 better than e00 (Fitelson and Hawthorne, 2010). Like Scheffler and Goodman (1972), the author also denies the EC and emphasizes falsification. Unlike almost all researchers, the author argues that for inference h1=s1->s2, in many cases, an evidence e00 can support h1 better than e11. 3 From figue 1 in this page: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2829436/. Fatty liver patients include borderline and NASH groups. 17 For a general universal hypothesis, there are four kinds of inferences as shown in Table 6. Each inference has a particular pair of positive example and counterexample. For h1= s1->s2, the positive example is e11 and the counterexample is e10. For h2=¬s2->¬s1, the counterexample is also e10, but the positive example is e00. In Mathematical Logic, h1=h2; however, in fuzzy logic or probability logic, h1 b1 and h2 b2 should be different because their proportion of counterexamples to positive examples are different. Table 6 Four kinds of inferences and their positive examples and counterexamples h3=s2->s1 h2=¬s2->¬s1 h1=s1->s2 positive example e11 counterexample e10 h0=¬s1->¬s2 counterexample e01 positive example e00 According Eq. (27), h1 and h2 should be given different degrees of disconfirmation (the optimized DOD): * 10 11 1 00 10 01 11 n n b n n n n     (28) * 10 00 2 11 10 01 00 n n b n n n n     (29) The partial derivatives of b1*=1-b1'* with respect to n11 and n00 respectively are: * 10 011 2 11 00 10 11( ) n nb n n n n     (30) * 10 01 111 2 00 11 00 10 ( ) ( ) n n nb n n n n     (31) The two partial derivatives tell us how much the two different positive examples e11 and e00 raise respectively the DOC of h1. Assume n=n11+n00+n10+n01 is much greater than 1. Then 11 * 1 nb  is the increment of b1* raised by a new single evidence e11, and 00 * 1 nb  is the increment of b1* raised by new e00. Why do people think that a non-black non-raven thing (such as a white chalk) is irrelative to "All ravens are black"? Firstly, people have never seen non-black raven, which means n10 =0 and n11>0. So b1* =1, no matter how big n00 is. If E∈A that only includes birds and h1 ="Most swans are white", then it should be acceptable by most researchers that an evidence e00 (a non-white non-swan thing) supports h1. If h1 refers to the result of rapid HIV + (positive) and means "This person has infected by HIV probably", an evidence e00 (a noninfected person whose rapid HIV test shows negative) should also supports h1. Secondly, as some researchers have pointed out (Good, 1960), for "All ravens are black", the number n00 of e00 (non-black non-raven thing) is very big so that an 18 evidence e00 can hardly affect h1 even if there are some counterexamples. Eq. (31) supports this idea since when n00 is much bigger then n01, 00 * 1 nb  is close to 0. Almost all researchers believe that evidence e11 can support h1 better than e00 (Fitelson and Hawthorne, 2010). However, according to Eq. (30) and (31), we can draw a different conclusion. When (1+n11/n01)n11-n10>n00, 11 * 100 * 1 nbnb  . For example, assuming n11=n01=10n10, then when n11>n00/1.9, 11 * 100 * 1 nbnb  . This means that in many cases, e00 can support h1 better than e11. This conclusion is unexpected. We may also use Eq. (28) to get similar conclusion. Assuming n00 is much greater than n10 and n11= n01, when n00 doubles, b1'* reduces almost half; yet, when n11 doubles, b1'* only reduces 1/4. So, in many cases, the increment of n00 can reduce the degree of disconfirmation of h1 faster than the increment of n11. Let's use OREQuick HIV tests4 to show the importance of e00 as the evidence for confirmation of h1. The result of a HIV Test is either + (positive) or – (negative). The h1 becomes + and h0 becomes -. For a given patient with HIV e1, the conditional probability of + is P(+|e1), which is called sensitivity by medical industry. For HIV-noninfected people, the conditional probability of is P(-|e0), which is called specificity (as shown in Table 7). Table 7 P(+|E) and P(-|E) for OREQuick HIV Tests with HIV (e1) without HIV (e0) P(+|E) sensitivity=0.917 1-specificity=0.001 P(-|E) 1-sensitivity=0.083 specificity=0.999 According to Eq. (24), b1'*=P(+|e0)/P(+|e1) = (1-specificity)/sensitivity =0.001/0.917≈0.0011; b1*=1-0.0011=0.9989; I(E; + b1*) =5.52 bits. For testing result -, b0'*= (1-sensitivity)/specificity=0.083/0.999=0.083; b0*=1-0.083=0.917; I(E; b0*) =0.04 bit. It is easy to notice that specificity is more important to raise the DOC of h1=+ than sensitivity. For example, even if sensitivity is 0.5 (or 0.1), as long as specificity is 1, then b1* will be 1, which means that it is absolutely believable to diagnose AIDS according to +. If specificity is 0.5, even if sensitivity is 1, then b1* will be only 0.5, which means that it is half believable to diagnose AIDS according to +. Similarly, we can prove that sensitivity is more important to raise the DOC of than specificity. The reason is that less counterexamples are more important than more positive examples for us to believe a hypothesis. 4 http://www.oraquick.com/taking‐the‐test/understanding‐your‐results 19 When n is big enough, specificity=n00/(n00+n10) and sensitivity=n11/(n11+n01). So, n00 is related to specificity. That is why e00 can support h1 better than e11 in many cases. Likelihood Ratio (LR) (=sensibility/(1-specificity=1/b'*) is used by medical community to tell how good the test-positive is. It is easy to find that b* is positively related to LR. The difference is the upper limit of b* is 1 so that b* is suitable as DOC. In addition, b'* can be used to predict the probability in which the testee has disease. Statisticians and some doctors often argue about the reliability of medical tests. If the prior probability P(e1) or P(HIV) of a person with HIV is about 0.004, then according to Bayes' formula, the posterior probability P(e1|+)=0.917*0.004/(0.001*0.996+0.917*0.004)=0.786, which is not high enough. If the prior probability P(e1) is 0.0001 instead of 0.004, the posterior probability P(e1|+) will be 0.08. Can we still believe the test? The author answers "yes" as most doctors. Actually this degree b1*=0.9989 is irrelative to the prior probability P(e1) or which group of people the testee belongs to. To predict the probability in which the testee has AIDS by +, most statisticians are right; yet, to believe the HIV Test or not, most doctors are right. However, to predict the probability, we may use the semantic bayes' formula Eq. (26). For example, for a testee who belongs to high-risk group of people, P(e1) =0.1. Then P(e1|h1 b*) =0.1/(0.1+0.0011*0.9) =0.991. This result is the same as P(e1|+) obtained by Bayes' formula. Yet, Eq. (26) is simpler. The popular confirmation measures, such as d(H, E) (Earman, 1992) and s(H, E) (Christensen, 1999; Joyce, 1999), use LP and conditional LP. One can get the LP from prior knowledge and the conditional LP from one or two evidences, without using sensibility and specificity. Yet it is difficult to apply them to medical tests. One may argues that they are used for the increments of DOC. Yet, as increments, they are too big. If more evidences come, how do we deal with the measures? As comparison, this paper uses Eq. (15) and (16) for the increments of DOC. If we replace LP with SP in d(H, E), d(+, E) will decreases with P(e1) increasing. This is unreasonable. When specificity is 1 and sensibility is 0.1, to predict the probability P(e1|+) by Bayes' formula or the semantic Bayes' formula, P(e1|+)=1, which means the prediction is 100% correct. Yet, both d(+, E) and s(+, E) are less than 0.1. For 100% accuracy predicted by +, so low DOCs are unreasonable. Why no others have proposed the b*? The reason might be that according to Eq. (23), b* might be -∞, and its upper limit (1) and lower limit (-∞) are asymmetrical. Yet, using the semantic information method, the negative DOC needs another formula and its lower limit is -1. 7. Negative DOC with "All swans are white" as Example First, consider the positive DOC for h1="All swans are white". Assuming that the posterior ADOD Q0/Q1=0.01/0.99 is less than the prior ADOD P0/P1=0.2/0.8 (see Table 8), then the DOC b* is positive. According to Eq. (24), we get b'*= (Q0/Q1)/(P0/P1) = (0.01/0.99) =0.0404; b*=1-b'*=0.9596. 20 Table 8 Positive DOC for "All swans are white" (b*>0) white swan (e1) non-white swan (e0) average information P(E) 0.8 0.2 P(E|h1) 0.99 0.01 IKL(E; h1)=0.2611 bit T(h1|E) 1 0 I(E; h1)=-∞ T(h1 b*|E) 1 b'*=0.0404 I(E; h1 b*)=0.2611 bit However, for a lie or wrong hypothesis, such as "All swans are not white", or a prediction from a stock commentator who is seen as a contrary indicator, or a hypothesis with excessive affirmation, we may modify the DOB into negative value to get more average semantic information. When a negative DOB b (b<0) is given to a general hypothesis hj, hj becomes hj b, whose the truth function is defined as T(hj b |E) =1+bT(hj|E), for b<0 (32) Now the DOD b'=1-|b|=1+b. Figure 5 illustrates how positive b and negative b affect T(h1 b |E). Now consider that counterexamples increase so that the DOC b* of h1="All swans are white" is negative. Assuming the prior ADOD is 0.01/0.99, the posterior ADOD increases to 0.05/0.95 after more black swans occur (as shown in Table 10). Figure 8 How positive b and negative b affect T(h1 b |E) 21 Table 9 Negative DOC for "All swans are white" (b*<0) white swan (e1) non-white swan (e0) average information P(E) 0.99 0.01 P(E|h1) 0.95 0.05 IKL(E; h1)= 0.060 bit T(h1|E) 1 0 I(E; h1)=-∞ T(h1 b*|E) b'*= 0.192 1 I(E; h1 b*)=0.060 bit According to Eq. (3) and (32), we get T(A1) = P0+ b'P1. According to Eq. (9) and (32), we get 1 0 1 0 1 0 1 1 ' ( ; ) log log ' ' b bI E h Q Q P b P P b P     (33) To optimize b, P0>0 is needed. Assuming P0>0 and derivative dI(E; h1 b)/db'=0, then b'*= (P0/P1)/(Q0/Q1) (34) where (P0/P1)/(Q0/Q1) must be less than 1, otherwise Eq. (22) is needed. According to Eq. (32), we have b*=b'*-1= (P0/P1)/(Q0/Q1)-1 (35) Therefore, when the posterior ADOD increases from 0.01/0.99 to 0.05/0.95, b'*= (0.01/0.99)/(0.05/0.95) =0.192; b*= b'*-1=-0.808. According to Eq. (35), if Q0>Q1, the posterior ADOD should be negative and equal to Q1/Q0-1. Similarly, if P0>P1, the prior ADOD is equal to P1/P0-1<0. It is not that only Eq. (34) is used to get negative DOC. The Eq. (22) can also be used to get negative DOC for excessive negation. The Eq. (34) can also be used to get positive DOC for proper negation (as shown in Table 10). Theorem 1. A denial hypothesis h0(E) =¬h1(E) with negative DOB b0 is equal to the affirmative hypothesis h1(E) with positive DOB |b0|, i. e., T(h0 b0|E) =T (h1 |b0||E), for Q0/Q1≤P0/P1 and b0<0 (36) Proof: According to fuzzy logic (Zadeh, 1965), T(h0|E) =1-T(h1|E). According to Eq. (20) and the fuzzy logic, T(h0 b0|E) =1+b0T(h0|E) =1+b0(1-T(h1|E)) =1+b0-b0T(h1|E). According to Eq. (20), 1+b0-b0T(h1|E) =1-|b0|+| b0|T(h1|E) = T (h1 |b0||E). Hence T(h0 b0|E) = T (h1 |b0||E). 22 Table 10 DOC in 4 cases and the meanings of hypotheses with DOC Initial hypotheses The ratio of counterexamples decreases: Q0/Q1≤P0/P1 b'*= (Q0/ Q1)/(P0/P1) The ratio of counterexamples increases: Q0/ Q1>P0/P1 b'*= (P0/P1)/(Q0/ Q1) h1(E)="All swans are white" b*=1-(Q0/Q1)/(P0/P1)>0 for proper affirmation, h1 b*≈"There are more white swans than we expect" b*=(P0/P1)/ (Q0/ Q1)-1<0 for excessive affirmation, h1 b*≈"There are less white swans than we expect" h0(E)="All swans are not white" b0*=(Q0/Q1)/(P0/P1)-1=-b*<0 for excessive negation, h0 b0*≈"There are less non-white swans than we expect"="There are more white swans than we expect"≈h1 |b0*|=h1 b* b0*=1-(P0/P1)/(Q0/ Q1)=|b*|>0 for proper negation, h0 b0*≈"There are more non-white swans than we expect"="There are less white swans than we expect"≈h1 -b0*=h1 b* Theorem 2. A denial hypothesis h0(E) =¬h1(E) with positive DOB b0 is equal to the affirmative hypothesis h1(E) with negative DOB –b0, i. e. T(h0 b0|E) =T (h1 -b0|E), for Q0/Q1>P0/P and b0>0 (37) Proof: T(h0 b0|E)= 1-b0+b0T(h0|E) =1-b0+b0(1-T(h1|E)) =1-b0T(h1|E)=1+(-b0)T(h1|E)=T(h1 -b0|E). The adjustment of DOB to increase its average information cannot be applied to all hypotheses or predictions. Most blind guesses do not convey positive average information, no matter how their DOB are adjusted. For example, if someone always predicts the rises or falls of stock markets by throwing a coin, then the DOC of his prediction can only be 0. Most wrong hypotheses still cannot convey meaningful positive information even if their DOBs are adjusted into negative values. For example, wrong prediction "Tomorrow is the end of the world" with any DOB can only convey information close to or less than 0. 8. The Degree of Confirmation of General Hypotheses In Section 5, a fuzzy universal hypothesis is be formalized as h1 b="For all E, if E ∈S1, then E∈A1". Now consider weather forecasts, GPS, and various fuzzy hypotheses, including fuzzy inferences and predictions. A general fuzzy hypothesis is formalized as: hj or hj b ="For all E and Z, if Z∈Cj, then E∈Aj" GPS is a good example of general hypotheses. Now E denotes the real position of a GPS device, H denotes the position pointed by GPS arrow. If E=ei and H=hj=êj, then êj is the estimation of ei by GPS according to condition Z (the distances to three or more satellites and other factors). 23 The initial hypothesis hj may be fuzzy or non-fuzzy. By making some assumptions for simplicity so that the initial hypothesis is non-fuzzy, we can also calculate the DOC b* of êj provided by GPS as above. Circular Error Probability (CEP)5 is often used to express the accuracy of GPS. The CEP=10 meters means that for a given position ei of GPS device, the probability in which the estimation hj=êj from ei is not farther than 10 meters is 0.5. Let Si denote the circle with 10 meter radius surrounding center ei. Then CEP=10 meters means that for give ei, the probability of correct estimations (êj in Si) is  ( | ) 0.5 j i j i e S P e e   , the probability of wrong estimations is 1-0.5=0.5. Let's simply assume that P(êj |ei) for all correct estimations are the same and equal to p1, and P(êj |ei) for all wrong estimations are the same and equal to p0; the number of all positions in a circle with 10 meter radius is n, and the number of all possible positions is 1000n. Then there are p1=0.5/n, and p0=0.5/(999n). Hence, according to Eq. (24) and (25), b'*=p0/p1=0.5/(999n)/(0.5/n) =1/999; b*=1-b'*=998/999. If the initial hypothesis hj is fuzzy, we may seek the DOB b* that makes the average semantic information reach the maximum: * ( | )arg max ( ; ) arg max ( | ) log ( ) b j ib j i j b b b i j T h e b I E h P e h T h    (38) which is the definition of the DOC of a general hypothesis. Another popular measure for GPS accuracy is Distance Root Mean Square (DRMS) 6. DRMS=10 means that the standard deviation between Ê and ei is 10 meters. Generally we consider that P(Ê |ei) as a normal distribution: 2 2 ( | ) exp[ | | / (2 )]i iP E e k E e d   (39) where k is normalized coefficient; ei is the real position of the GPS device; Ê is the estimation of ei; d is DRMS or standard deviation which implies the precision of GPS. The DRMS=10 means that possibility in which the deviation within 10 meters is 65%. In this case, the initial hypothesis hj is fuzzy. Assume Eq. (39) is tenable for different E, hence there is 2 2 ( | ) exp[- | - | /(2 )]P E E k E E d (40) However, the above distribution is only for ideal estimation. The actual estimation may be the function of condition Z, i. e., Ê=f(Z). The deviation distribution may be 2 2 ( | ) exp[ | | / (2 )]P E E k E e E d c      (41) 5 http://www.igage.com/mp/GPSAccuracy.htm 6see http://www.radio‐electronics.com/info/satellite/gps/accuracy‐errors‐precision.php 24 where Δe is systematic deviation (Δe=0 means the highest accuracy); c mean that there are more long distance deviations, which come from wrong map or systematic failure. It is the c that determines the DOC b* of Ê. To confirm H=Ê, the average semantic information becomes semantic mutual information: ( | )  ( ; ) ( ) ( | ) log ( ) b j ib j i j b j i j T h e I E H P e P e e T h   (42) Assuming the optimized truth function is * * 2 *2 * ( | ) exp[ | | ) / (2 ] 1bk kT e E b e E d b     , k=1, 2 ... (43) by using Eq. (34) and (18), we can derive that when êk=êj-Δe (j=1, 2...), d*=d, b*=1-c/(k+c), the semantic mutual information I(E; Hb*) reaches the maximum. From this example, it is easy to find that we need to consider three factors: accuracy, precision, and DOC for selecting a hypothesis from many with information criterion. If E is not equally probable, LP T(Aj) is a better measure than d as precision because T(Aj) is also related to P(E) and determines the testing severity. Actually, optimizing T(H|E) with P(H|E) is optimizing semantic channel so that it matches Shannon's channel. If a GPS user wants to confirm or optimize the estimation êj (or Ê), he needs P(êj|E), which cannot be obtained directly. Since when E is equally probable, i. e., P(E)=a (a is a constant), we could derive P(Ê)=a, and P(êj|E)=P(E|êj)=P(E, êj)/a. So, the user may put the GPS device at different positions in equal probability to record E and Ê. Then he could get P(Ê, E) and P(êj|E), j=1 2... n. If the probability distribution P(E|êj) of samples is obtained when P(E)≠constant, one could, according to Eq. (19), use P(E|êj)/P(E) to replace P(êj|E) to confirm êj. In comparison with MLE, MSIE has two advantages: 1) The MSIE can be used in cases where source P(E) is variable; 2) Generally, P(hj|E) is more regular than P(E|hj) so that it is easier to construct T(Aj|E) proportional to P(hj|E) than to construct P(E|Aj) close to P(E|hj). For example, P(E|Aj) predicted by GPS will not be a normal distribution and changes from place to place because roads are irregular; yet, P(hj|E) is an approximately normal distribution and almost changeless. 9. Conclusions and Discussions In this paper, the semantic information measure is used as the main tool for falsification and confirmation. This measure is compatible with Shannon's theory, Popper's theory, and Fisher's likelihood method. With generalized Kullback-Lribler formula, we can use objective sampling distribution to test subjective probability prediction (likelihood function); use objective selecting rule function (or Shannon channel) to confirm subjective truth function (or semantic channel). The basic conclusions about falsification and confirmation are: 1) For the falsification (including test, selection, and optimization) of hypotheses, we need semantic information as criterion, which means that the more prior precise and unexpected and the more posterior accurate and believable a hypothesis is, the more information it conveys and hence the better it is. 2) Confirmation, for which less counterexamples 25 are more important than more positive examples, is to get the optimized degrees of belief in hypotheses to increase average semantic information, and hence is a helper of falsification. MSIE and MLE use the same semantic information criterion in essence and compatible with Shannon's information theory. But, the MSIE is more suitable to cases where source P(E) is variable, channel P(H|E) is stable and regular, and the amount of ample are huge. The MSIE requires that samples are independent and their distribution is stable. If there are only fewer samples, the distribution P(E|hj) may be unstable and will result in over fitting. To resolve this problem, we may decrease the degree of confirmation obtained according to the samples. Can we set up the relationship between the number of samples and the degree of confirmation, or prior limit the extents of precision and degree of belief in cases with fewer samples? This is a question that needs further study. The Maximum A Posterior (MAP) estimation (DeGroot, 1970) may get better results when there are fewer samples. Besides likelihood, MAP also use prior probability P(θ) or P(hj). Yet MAP is not compatible with Shannon's information theory. Is the prior probability P(θ) statistical or logical? This could be a question. Similarly, the MSIE uses the prior probability distribution P(E) and similar Bayes formula (P(hj|E) =...). In cases with fewer samples, how could we make full use of prior knowledge better?This deserves further studies as well. References Adriaans, Pieter. 2010. "A critical analysis of Floridi's theory of semantic information." Knowledge, Technology & Policy, 23:41-56. Akaike, Hirotugu. 1974. "A New Look at the Statistical Model Identification." IEEE Transactions on Automatic Control 19:716–723 Alderton, David. 2005. The Illustrated Encyclopedia of Birds of the World. Wigston: Anness Publishing Ltd. Aldrich, John. 1997, R. A. "Fisher and the Making of Maximum Likelihood 1912–1922." Statistical Science 12:162–176. Bar-Hillel, Yehoshua, and Rudolf Carnap. 1952. "An Outline of a Theory of Semantic Information." Tech. Rep. No. 247, Research Lab. of Electronics, MIT. Carnap, Rudolf. 1952. The Continuum of Inductive Methods. Chicago: University of Chicago Press. Christensen, David.1999."Measuring Confirmation." The Journal of Philosophy 96:437–461. D'Alfonso, Simon. 2011. "On Quantifying Semantic Information." Information 2:61-101 DeGroot, Morris H. 1970. Optimal Statistical Decisions, New York: McGraw-Hill. Earman, John. 1992. "Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory." Cambridge, MA: MIT Press. Eells, Ellery, Branden Fitelson. 2000. Measuring Confirmation and Evidence." Journal of Philosophy 97:663–672. Fitelson, Branden. 1999. "The Plurality of Bayesian Measures of Confirmation and the Problem of Measure Sensitivity." Philosophy of Science 66:S362–378. ---. 2007. "Likelihoodism, Bayesianism, and Relational Confirmation." Synthese 156: 473–489. 26 Fitelson, Branden, James Hawthorne. 2010. "How Bayesian Confirmation Theory Handles the Paradox of the Ravens", Eells and Fetzer (eds.), The Place of Probability in Science Open Court: 247-275. Floridi, Luciano. 2004. "Outline of a theory of strongly semantic information." Minds and Machines 14:197-221. ---. "2005/2015. "Semantic conceptions of information." in Stanford Encyclopedia of Philosophy, ed. Edward N. Zalta. http://plato.stanford.edu/entries/information-semantic/ Good, Irving John. 1960. "The Paradox of Confirmation." The British Journal for the Philosophy of Science 11:145-149 Hawthorne, James. 2004/2012. "Inductive Logic." In Stanford Encyclopedia of Philosophy ed. Edward N. Zalta. http://plato.stanford.edu/entries/logic-inductive/ ---. 2005. "Degree-of-Belief and Degree-of-Support: Why Bayesians Need Both Notions." Mind 114:277–320. Hempel, Carl G. 1945. "Studies in the Logic of Confirmation." Mind 54:1–26 and 97–121. Huber, Franz. 2005. "What Is the Point of confirmation?" Philosophy of Science 72: 1146–1159. Joyce, James. 1999. The Foundations of Causal Decision Theory. New York: Cambridge University Press. Klir, George J. 2005. Uncertainty and Information: Foundations of Generalized Information Theory, Hoboken: John Wiley. Kullback, Solomon, and Richard Leibler. 1951. "On information and Sufficiency." Annals of Mathematical Statistics 22:79–86. Lu, Chenguang. 1991. "B-fuzzy set algebra and a generalized cross-information equation." Fuzzy Systems and Mathematics(in Chinese), 5:76-80. --1993. A Generalized Information Theory( in Chinese), Hefei, China Science and Technology University Press. --1994. "Meanings of generalized entropy and generalized mutual information for coding." J. of China Institute of Communication(in Chinese), 15: 37-44. --1999."A generalization of Shannon's information theory."Int. J. of General Systems, 28 (6): 453-490. --2012. "GPS Information and Rate-Tolerance and its Relationships with Rate Distortion and Complexity Distortions." Journal of Chengdu University Of Information Technology(in Chinese). 6: 27-32. Maher, Patrick. 1999. "Inductive Logic and the Ravens Paradox." Philosophy of Science 66:50–70. Milne, Peter. 1996. "log[P(h/eb)/P(h/b)] Is the One True Measure of Confirmation." Philosophy of Science 63:21-26. Popper, Karl. 1935/1959. Logik Der Forschung: Zur Erkenntnistheorie Der Modernen Naturwissenschaft, Wien: J. Springer; English translation: The Logic of Scientific Discovery, London: Hutchinson. ---. 1963/2005. Conjectures and Refutations. Repr. London and New York: Routledge. Rosie, Aeneas M. 1966. Information and Communication Theory. New York: Gordon and Breach. Scheffler, Israel, Nelson Goodman. 1972. ""Selective Confirmation and the Ravens." Journal of Philosophy 69:78-83. Shannon, Claude, E. 1948. "A mathematical theory of communication." Bell System Technical Journal 27:379–429, 623–656. Tentori, Katya, Vincenzo Crupi, Nicolao Bonini and Daniel Osherson. 2007. "Comparison of Confirmation Measures." Cognition 103:107-119 27 Wang, P. Zhuang, and Elie Sanchez. 1982. "Treating a Fuzzy Subset as a Projectable Random Set." In Fuzzy Information and Decision. ed. Madan M. Gupta, Elie Sanchez. Oxford: Pergmon Press: 212-19. Wittgenstein, Ludwig. 1958. Philosophical Investigations. Oxford: Basil Blackwell Ltd. Zadeh, Lotif A. 1965. "Fuzzy Sets." Information and Control 8:338–53. ---. 1986. "Probability Measures of Fuzzy Events." J. of Mathematical, Analysis and Applications 23:421-27. Appendix: List of Symbols A={e1, e2...} is a set of evidences, ei-one element in A E∈A is a variable Aj is a fuzzy subset of A; the elements in Aj make hypothesis hj be true B= {h1, h2...} is a set of hypotheses; hj is one element in B; H∈B is a variable C= {z1, z2...} is a set of conditions; Cj is a subset of C. When Z∈Cj, hj is selected. hj(ei)-a proposition, such as hj(ei) ="ei≈ej"; hj(E)-a predicate, such as, hj(E) ="E≈ej" hj b-hj whose degree of belief is b T-logical probability or true value; T(hj) =T(Aj)-logical probability of hj= hj(E) T(hj|ei) =T(Aj|ei) - true value of a proposition hj(ei) T(hj|E) =T(Aj|E)-truth function of a predicate hj(E) PStatistical probability; P(E)-prior probability (function), or source, P(ei)-the prior probability of ei P(hj)-statistical or selective probability of hj T(hj) =T(Aj)-logical probability or average true value of hj(E) P(E|Aj) =P(E)T(Aj|E)/T(Aj)-9 probability, or theoretical prediction P(E|hj) =P(E)P(hj|E)/P(hj)-sampling distribution, inverse condition probability P(hj|E)-conditional probability, selective probability (function) of hj e(t)-the evidence under t-th condition z(t), t=1, 2... w. e(t)∈A Assume the number of ei in {e(t), e(t)...e(w)} is wi and w is enough big, then P(ei|Cj) =wi/w. P(E|Cj) =P(E|Z∈Cj)=P(E|hj) is the sampling distribution on A under conditions in Cj. P(E|Aj)-likelihood b- Degree of Belief (DOB);b'=1-|b|--Degree of Disbelief (DOD) b*-Degree of Confirmation (DOC), i. e., optimized DOB b'*--optimized degree of disbelief ADOD-Absolute degree of disbelief I(ei; hj)-semantic information conveyed by hj about ei I(E; hj)-average information conveyed by hj about E KL--Kullback-Leibler GKLF-generalized Kullback-Leibler Formula LP-logical probability; SP-statistical probability or selected probability SIM-Semantic Information Measure MLE-Maximum likelihood estimation MSIE-Maximum semantic information estimation