1 Introduction

Replicability has become a widely discussed issue in many empirical sciences in the past decade, and its importance is now acknowledged by the scientific community. Since experimental philosophy is typically grounded in empirical evidence, replicability is no less important in this research area than in other areas.

In this paper, we report the results of three high-powered replications in experimental philosophy (or ‘x-phi’ for short), which bear on an alleged instability of folk philosophical intuitionsFootnote 1: the purported susceptibility of epistemic intuitions about the Truetemp case (Lehrer, 1990) to order effects. Evidence for this susceptibility was first reported by Swain et al. (2008); further evidence was then found in two studies by Wright (2010) and Weinberg et al. (2012).Footnote 2 These empirical results have been quite influential in the metaphilosophical debate about the method of cases (e.g., Horvath, 2010; Alexander, 2012, p. 79; Cappelen, 2012, pp. 167, 220–222; Wright, 2016; Machery, 2017, pp. 70–71), including in recent metaphilosophical discussions (e.g., Lycan, 2019, p. 103; Strevens, 2019, pp. 72, 199; Nado, 2021, Footnote 1; Woodward, 2021, p. 165). Here is how Swain et al. (2008, p. 152) summarize the metaphilosophical significance of their findings:

Specifically, we found that intuitions about the Truetemp Case vary depending on whether, and which, other cases are presented before it. Such variability calls into question the legitimacy of using the intuitions generated by the Truetemp Case as evidence against reliabilism. But it is unclear what about this case makes it susceptible to these effects, which raises questions about the reliance on intuitions about thought-experiments more generally, especially given that this is not the only case called into question by empirical research.

Given the considerable impact of their experimental results, it seems particularly important to ascertain whether they are replicable, and thus likely to track an actual psychological effect. This replication will also provide evidence relevant to the emerging debate about the extent to which subtle manipulations can impact philosophical judgments (Cova et al., 2021; Knobe, 2021). As we will see, our replications by and large failed to corroborate Swain et al.’s and Wright’s findings.

In Sect. 1, we draw a broader picture of the role of replications in empirical sciences. Section 2 briefly outlines the experiments we replicatedFootnote 3 and their philosophical importance. In Sects. 3, 4, and 5, we describe the methods and results of our three replications. In the final Sect. 6, we discuss some plausible conclusions from our results and assess their metaphilosophical significance.

2 The importance of replications

For decades, replications were quite rare, including for influential studies, especially in the social sciences. Spectacularly failed replications of popular and widely cited studies have only recently sparked controversy and encouraged the scientific community to reconsider the importance of replications for scientific progress. Furthermore, scientists have led large-scale replication projects to evaluate entire research areas with respect to their replicability. The Open Science Collaboration (2015), for instance, replicated 100 studies from the recent literature in psychology and successfully corroborated less than half of them (the exact ratio depends on the chosen criterion).

Obviously, an unsuccessful replication of a study that showed an effect E does not necessarily show that E is a false positive. The replication might not be powerful enough to detect a genuine, but small effect (Machery et al., 2020); moreover, some replications will fail to confirm the original finding by chance, even if the original study discovered a genuine effect. However, the concerningly low replicability in large-scale replication projects in the social sciences cannot simply be explained away by pointing at issues of this kind. Rather, it draws attention to much larger and more problematic issues, such as the role of publication practices in distorting scientific evidence (e.g., publication bias), and the damaging impact of questionable research practices, such as p-hacking (i.e., the set of practices that increase the probability of getting a statistically significant result; Simmons et al., 2011).

The first and, so far, only large-scale replication project in experimental philosophy (Cova et al., 2021), which reran 40 studies, brought better news about replicability than what was found in psychology and other disciplines: about three quarters of the original x-phi findings were corroborated. This does not mean, however, that experimental philosophy does not have its own spectacular replication failures. One of the foundational works of the x-phi movement by Weinberg et al. (2001), who reported multiple cross-cultural differences in folk epistemic intuitions, is a good example here. Except for the difference between Americans’ and Indians’ knowledge attributions in the Zebra case (Dretske, 1970), which was recently successfully replicated by Sękowski et al. (2021), attempts to confirm these effects have all failed (e.g., Kim & Yuan, 2015; Seyedsayamdost, 2015; see also Machery et al., 2017). Nevertheless, the original results were, for many years, taken as genuine evidence in favor of the negative program in experimental philosophy, which challenges the use of intuitions about cases in philosophy (e.g., Alexander, 2012; Machery, 2017; Weinberg, 2007). In the remainder of this article, we argue that a similar problem affects the Truetemp order effects reported by Swain et al. (2008) and Wright (2010), whose data were also used to fuel the program of negative x-phi.

3 The alleged instability of epistemic intuitions

The Truetemp case, introduced by Lehrer (1990), plays a central role in Swain et al.’s (2008) and Wright’s (2010) studies. This hypothetical scenario is important for epistemology because it is supposed to elicit intuitions inconsistent with a prominent account of knowledge, reliabilism (endorsed, for example, by Armstrong, 1973 and Goldman, 1979, 1986). According to this view, knowing that p is to have a true belief that p that results from a sufficiently reliable cognitive process of belief acquisition. In the Truetemp case, however, the protagonist, Mr. Truetemp, acquires an unusual ability by a stroke of pure luck: he can precisely determine the temperature of his surroundings, but he is unaware that he has this ability. Now the question arises: does Mr. Truetemp know that the temperature in his room is 71 °F when he forms the belief that it is 71 °F in his room now?Footnote 4 Lehrer argues that, although the reliabilist conditions for knowledge are met here, Mr. Truetemp does not know the relevant proposition; hence, the Truetemp case would be a counterexample to reliabilism about knowledge.

The main idea of Swain et al. (2008) is that the Truetemp case elicits unstable intuitions because knowledge attributions in this case depend on what the reader has read beforehand. They predicted a contrastive pattern of responses: subjects should be more likely to attribute knowledge in the Truetemp case when it is preceded by a clear case of non-knowledge than when it is preceded by an uncontroversial case of knowledge; they also predicted that presenting the Truetemp case on its own would yield knowledge ratings falling somewhere in between the two other conditions (for their vignettes, see Sect. 3.2).Footnote 5 And indeed, they reported the predicted contrastive effect (Fig. 1): when they asked laypeople whether the protagonist in the Truetemp case knew the temperature in his room, and offered them a 5-point Likert scale from “strongly agree” (5) to “strongly disagree” (1), those who first saw the clear case of knowledge were less likely to agree with that claim (M = 2.4) than those who first saw the clear case of non-knowledge (M = 3.2), with the ratings of those who first saw the Truetemp case being somewhere in the middle (M = 2.8); these differences are reported as statistically significant.

Fig. 1
figure 1

Average knowledge ratings concerning the Truetemp case in Swain et al.’s (2008) experiment as a function of order of presentation (“TrueTemp” is the condition where the Truetemp case was presented first); 1: “strongly disagree”; 5: “strongly agree”Footnote

The statistics provided by Swain et al. are not sufficient to calculate confidence intervals.

As we have seen, Swain and colleagues take these results to undermine the trustworthiness of Truetemp intuitions and even take their conclusion one step further, by questioning the role of intuitions in thought experiments more generally.

However, the empirical evidence provided by Swain et al. (2008) in favor of their bold metaphilosophical claims may not be sufficiently robust. Weinberg et al. (2012) replicated and corroborated their earlier study, but the results were only “marginally significant,” i.e., they were in fact non-significant, for participants who had a low and intermediate need for cognition. Nonetheless, they assert that “These numbers suggest that the Truetemp intuitions of our lower-NFC [need for cognition] subjects are trending towards a pattern like that reported in Swain et al. (2008)” (2016, p. 308). The results for participants with a high need for cognition did turn out significant, but in the opposite direction from the original Swain et al. study; in addition, the total sample size was so small (n = 28) that this finding needs to be taken with an extra grain of salt. Furthermore, in a recent paper, Ziółkowski (2021) presents substantial worries about Swain et al.’s statistical analysis, their experimental design, and, more importantly, reported his own results of three replications that attempted to find the order effect in question. None of these attempts succeeded, however: order of presentation had no impact on subjects’ knowledge ratings for the Truetemp case.

Although Ziółkowski’s (2021) study puts Swain et al.’s (2008) empirical results in jeopardy, it does not fully answer the question of whether the order effect in question is real. Although Ziółkowski’s replications had a noticeably larger sample size than the original study, they might still be underpowered to detect the effect in question if it is a fairly small effect. Furthermore, Wright (2010), which adopted slightly different methods, found similar order effects, but Ziółkowski did not attempt to replicate them.

Wright’s (2010) experiments used different variants of the Truetemp vignette in each study, only one of which was taken directly from Swain et al. (see Sects. 4.2 and 5.2). She also used a dichotomous answering-format (“yes”/“no”) instead of a Likert scale. Wright’s Experiment 1 included three conditions: the Truetemp case preceded by a clear case of knowledge, by a clear case of non-knowledge, and by the Fake-Barn case (Goldman, 1976). She reported a pattern similar to Swain et al.’s (Fig. 2): laypersons were more likely to attribute knowledge in the Truetemp case when it was preceded by the clear case of non-knowledge (55%) than when it was preceded by the clear case of knowledge (40%). Moreover, presenting the Fake-Barn case before the Truetemp case yielded the lowest ratio of knowledge attributions (26%). The study did not include a baseline condition (i.e., the Truetemp case just on its own).

Fig. 2
figure 2

Percentages of knowledge attributions in Wright’s (2010) Experiment 1 as a function of order of presentation (“TT” is short for “Truetemp”) (the statistics provided by Wright are not sufficient to calculate confidence intervals)

In Experiment 2, Wright tested several additional vignettes that do not bear on our discussion, but she also included the Truetemp manipulation (Fig. 3); the formulations of the Truetemp case and of one of the preceding cases were different than in Experiment 1 (see Sect. 5.2), but once again, participants were more likely to attribute knowledge when Truetemp was presented after the clear case of non-knowledge (84%) than when it was presented after the clear case of knowledge (57%).

Fig. 3
figure 3

Percentages of knowledge attributions in Wright’s (2010) Experiment 2 as a function of order of presentation (“TT” is short for “Truetemp”) (once again, the statistics reported by Wright did not allow us to calculate confidence intervals)

Given the importance of the Truetemp case in epistemology and the uncertain status of the results reported by Swain et al. and Wright, we ran a series of high-powered replications of Swain et al.’s and Wright’s experiments to determine whether Truetemp intuitions are in fact unstable in the suggested way, i.e., depending on the previous presentation of clear examples of knowledge or non-knowledge. None of the replications below are what are known as “exact replications” or—in Machery’s (2020a) terminology—experimental-units replications; they are “conceptual replications.” That is, in addition to changing the participants, our replications also changed other aspects of the original experimental designs (including the recruitment method)Footnote 7—either because we had concerns with the design of the original studies, or because the original experimental designs were needlessly complex for our main purpose of assessing whether knowledge attributions in the Truetemp case can be influenced by the order of presentation. We will revisit these issues in Sect. 6.

4 Experiment 1: Replication of Swain et al. (2008)

The first experiment is a very high-powered replication of the three crucial conditions from Swain et al. (2008). It tests whether knowledge attributions in the Truetemp case decrease when a clear case of knowledge is presented first, and whether they increase when a clear case of non-knowledge is presented first—with the baseline condition, where only the Truetemp case is presented, as the third condition. Study materials and data (for this and the two consecutive experiments) are publicly available at: https://osf.io/q56ru/.

4.1 Participants

In this and all following experiments, participants were recruited on the online platform Prolific (https://www.prolific.co), completed an online survey on Unipark (https://www.unipark.com), and were required to be native English speakers. As preregistered (https://osf.io/j5zpk), the experiment was run until validFootnote 8 responses of 1626 participants (542 in each condition) were collected, which results in 95% power for detecting a small effect of d = 0.2 (one-tailed t-test at a standard .05 significance level) between two conditions. Average age was 37 years, 39% were male, 60% female, and 1% non-binary. Participants received £0.20 for an estimated 2 min of their time (£6/h).

4.2 Design, procedure, and materials

Participants were randomly assigned to one of three conditions (Baseline, Knowledge, Non-Knowledge).Footnote 9 In Baseline, only the Truetemp case was presented:

One day Charles was knocked out by a falling rock; as a result his brain was ‘‘rewired’’ so that he is always right whenever he estimates the temperature where he is. Charles is unaware that his brain has been altered in this way. A few weeks later, this brain rewiring leads him to believe that it is 71 degrees in his room. Apart from his estimation, he has no other reasons to think that it is 71 degrees. In fact, it is 71 degrees.

Please indicate to what extent you agree or disagree with the following claim:

‘‘Charles knows that it is 71 degrees in his room.’’

In Knowledge, the Truetemp case was presented after a clear case of knowledge:

Karen is a distinguished professor of chemistry. This morning, she read an article in a leading scientific journal that mixing two common floor disinfectants, Cleano Plus and Washaway, will create a poisonous gas that is deadly to humans. In fact, the article is correct: mixing the two products does create a poisonous gas. At noon, Karen sees a janitor mixing Cleano Plus and Washaway and yells to him, “Get away! Mixing those two products creates a poisonous gas!”

Please indicate to what extent you agree or disagree with the following claim:

“Karen knows that mixing these two products creates a poisonous gas.”

In Non-Knowledge, the Truetemp case was preceded by a clear case of non-knowledge:

Dave likes to play a game with flipping a coin. He sometimes gets a “special feeling” that the next flip will come out heads. When he gets this “special feeling,” he is right about half the time, and wrong about half the time. Just before the next flip, Dave gets that “special feeling,” and the feeling leads him to believe that the coin will land heads. He flips the coin, and it does land heads.

Please indicate to what extent you agree or disagree with the following claim:

“Dave knew that the coin was going to land heads.”

The response option was a 5-point Likert-item ranging from “strongly agree” (later coded as 5) to “strongly disagree” (later coded as 1).Footnote 10

After the Truetemp case, participants saw a comprehension check asking how often Charles is right when he estimates the temperature, with “never right (0%)”/“half of the time (50%)”/“always right (100%)” as response options. Only participants choosing the correct response (“100%”) were included in the analysis (see Footnote 6). On the last page, participants were asked a number of standard demographic questions.

4.3 Results

Figure 4 shows that knowledge attributions were roughly the same across all conditions: MBaseline = 3.61 (SDBaseline = 1.31), MKnowledge = 3.47 (1.32), MNon-Knowledge = 3.53 (1.26). According to a one-way ANOVA, there is no significant difference between the conditions, F(2, 1623) = 1.43, p = .239, \({\upeta }_{p}^{2}\) = .002. The preregistered t-test assessing whether knowledge attributions decrease if Truetemp is presented after a clear case of knowledge (Baseline vs. Knowledge) was barely significant with a very small effect size, t(1082) = 1.66, p = .048 (one-tailed, not adjusted for multiple comparisons), d = 0.10Footnote 11 (95% CI [− 0.02; 0.22]). The other preregistered t-test, testing whether knowledge attributions increase if Truetemp is presented after a clear case of non-knowledge (Non-Knowledge vs. Baseline), was not significant and went in the direction opposite to the one expected, t(1080.4) =  − 1.01, p = .311 (two-tailed,Footnote 12 unadjusted), d =  − 0.06 (95% CI [− 0.18; 0.06]).

Fig. 4
figure 4

Average levels of agreement (1: strong disagreement; 5: strong agreement) with the claim that Charles knows it is 71° in his room as a function of order of presentation (i.e., Truetemp presented first vs. after a clear case of knowledge vs. after a clear case of non-knowledge). Error bars represent 95% confidence intervals

4.4 Discussion

For the most part, Experiment 1 did not successfully replicate Swain et al.’s findings, in line with Ziółkowski’s (2021) failed replication. Knowledge attribution was similar when the Truetemp case was presented separately and when it followed a clear case of non-knowledge. It decreased in the direction expected when the Truetemp case followed a clear case of knowledge, but the observed effect is very small, in fact too small to be of any philosophical significance (for further discussion, see Sect. 6).

5 Experiment 2: Replication of Wright’s (2010) Experiment 1

Our second experiment is a high-powered replication of the crucial aspects of Experiment 1 in Wright (2010), again focusing on whether knowledge attributions in the Truetemp case are influenced by previously presented contrasting scenarios. As mentioned above, unlike Swain et al. (2008), Wright used a simple yes/no response-format, and—in addition to Baseline, Knowledge, and Non-Knowledge—she also included a Fake-Barn case. Wright found that participants were significantly more likely to attribute knowledge in the Truetemp case when it immediately followed a case of clear non-knowledge (55%) than when it either immediately followed a clear case of knowledge (40%) or the Fake-Barn case (26%)—see Fig. 2. In this experiment, we also added a Baseline condition (unlike Wright) and tested the following predictions for knowledge attributionFootnote 13: Knowledge and Fake Barn < Baseline < Non-Knowledge.

5.1 Participants

As preregistered (https://osf.io/uqx5g), the experiment was run until validFootnote 14 responses of 740 (out of 900) participants (185 in each condition) were collected, which results in 90% power for detecting a 15% difference (65% from 50%, one-tailed) between two conditions at the .05 significance level. Mean age was 36 years, 36% were male, 63% female (3 participants were non-binary or preferred not to indicate their gender). Participants received £0.25 for an estimated 2 min of their time (£7.5/h).

5.2 Design, procedure, and materials

Participants were randomly assigned to one of four conditions (Baseline, Knowledge, Non-Knowledge, Fake Barn). Baseline, Knowledge, and Non-Knowledge were identically worded as in Experiment 1, and the wording of the additional Fake-Barn case was as follows:

Suzy looks out the window of her car and sees a barn near the road, and so she comes to believe that there’s a barn near the road. However, Suzy doesn’t realize that the countryside she is driving through is currently being used as the set of a film, and that the set designers have constructed many Fake-Barn facades in this area that look as though they are real barns. In fact, Suzy is looking at the only real barn in the area.

Does Suzy know that there’s a barn near the road? [Yes/No]

In Knowledge, Non-Knowledge, and Fake Barn, participants saw the respective case before the Truetemp case, whereas in Baseline they only saw the Truetemp case.Footnote 15

5.3 Results

Figure 5 shows the percentage of knowledge attributions in the four conditions. In Baseline, 38.92% [95% CI 31.85%; 46.35%] ascribed knowledge in the Truetemp case. Knowledge ascriptions were marginally higher when a clear case of knowledge was presented first, 41.08% [33.92; 48.54]. Knowledge ascriptions were highest when a clear case of non-knowledge came first, 56.76% [49.29; 64.01], followed by the Fake Barn condition, 51.89% [44.44; 59.28]. Knowledge attributions across the four conditions differ significantly, χ23, N=740 = 16.285, p < .001.

Fig. 5
figure 5

Percentages of knowledge attribution in the Truetemp case as a function of order of presentation (i.e., Truetemp presented first vs. after a clear case of knowledge vs. after a clear case of non-knowledge vs. after the Fake-Barn case). Error bars represent 95% confidence intervals

The following z-tests for proportions were preregistered to assess whether the rate of knowledge attribution in Baseline differs from each of the other conditions. Testing whether knowledge attributions decrease if the Truetemp case is presented after a clear case of knowledge became redundant since percentages were higher in the Knowledge than in the Baseline condition (χ21, N=370 = 0.18, p = .671, two-tailed; unadjusted). Comparing Baseline with Non-Knowledge showed that knowledge attributions were, as predicted, significantly higher when a clear case of non-knowledge was presented first (χ21, N=370 = 11.76, p < .001 one-tailed). The proportion of participants ascribing knowledge in the Fake Barn condition was also significantly different from the proportion in Baseline, but the effect was in the opposite direction than what Wright (2010) had found, χ21, N=370 = 6.28, p = .0122, two-tailed.

5.4 Discussion

Experiment 1 provided evidence that reading the Truetemp case after a clear case of knowledge had a small but philosophically trivial effect on knowledge attribution. Experiment 2 provides further support to the idea that knowledge attribution in the Truetemp case is not influenced by a contrast with clear cases of knowledge: In Experiment 2, reading the Truetemp case after a clear case of knowledge had no measurable effect at all, contrary to the original findings of Swain and colleagues.

On the other hand, the results of Experiment 2 are at odds with those of Experiment 1 in surprising ways. In contrast to Experiment 1, the results of Experiment 2 suggest that knowledge attributions in Truetemp case can be substantially influenced by the preceding scenario, but only in the Non-Knowledge and Fake Barn conditions. As predicted, and also found by Wright (2010), more participants in Experiment 2 ascribed knowledge to the protagonist in the Truetemp case when a clear case of non-knowledge was presented first. Our results suggest that presenting a Fake-Barn case first has a similar effect, although Wright (2010) found that this order of presentation decreases knowledge attributions for Truetemp.

6 Experiment 3: Replication of Wright’s (2010) Experiment 2

Our third experiment is a high-powered replication of the crucial aspects of Experiment 2 in Wright (2010), again focusing on whether knowledge attributions in the Truetemp case can be influenced by preceding scenarios. We thus focus only on a subset of the vignettes used by Wright (the three vignettes in her “set 1” of vignettes; Wright, 2010, p. 496), which included both a different version of the Truetemp case and of the clear case of knowledge, while the clear case of non-knowledge was the same. Participants in Wright’s Experiment 2 were significantly more likely to say that the agent in the Truetemp case knew the temperature when they read it immediately after the clear case of non-knowledge (84%), in contrast to reading Truetemp after the clear case of knowledge (57%)—see Fig. 3. Again, we added an additional baseline condition and tested the following predictions for knowledge attributions: Knowledge < Baseline < Non-Knowledge.Footnote 16

6.1 Participants

As preregistered (https://osf.io/gdevf), the experiment was run until valid responses of 555 (out of 689) participants (185 in each condition) were collected, which results in 90% power for detecting a 15% difference in proportions (65% to 50%, one-tailed) between two conditions at the .05 significance level. Mean age was 31 years, 35% were male, 64% female, and 2% non-binary. Participants received £0.35 for an estimated 3 min of their time (£7.5/h).

6.2 Design, procedure, and materials

Participants were randomly assigned to one of three conditions (Baseline, Knowledge, Non-Knowledge), with the Non-Knowledge case being the same as in the first two experiments. The now different version of the Truetemp case was as follows (except for some minor differences—e.g., “Charles” is used instead of “Mr. Truetemp”, and “71 degrees” instead of “104 degrees”—this version is a literal quote of Lehrer’s (1990, pp. 163–164) original presentation of the case):

Suppose Charles undergoes brain surgery by an experimental surgeon who invents a small device which is both a very accurate thermometer and a computational device capable of generating thoughts. The device, called a tempucomp, is implanted in Charles’ head so that the very tip of the device, no larger than the head of a pin, sits unnoticed on his scalp and acts as a sensor to transmit information about the temperature to the computational system of his brain. This device, in turn, sends a message to his brain causing him to think of the temperature recorded by the external sensor. Assume that the tempucomp is very reliable, and so his thoughts are correct temperature thoughts. All told, this is a reliable belief-forming process. Charles has no idea that the tempucomp has been inserted in his brain, is only slightly puzzled about why he thinks so obsessively about the temperature, but never checks a thermometer to determine whether these thoughts about the temperature are correct. He accepts them unreflectively, another effect of the tempucomp. Thus, at a particular moment in time he thinks and accepts that the temperature is 71 degrees – and it is, in fact, 71 degrees.

Does Charles know that it is 71 degrees at this particular moment? [Yes/No]

The clear case of Knowledge was as follows:

Pat walks into her kitchen during the day when the lighting was good and there was nothing interfering with her vision. She sees a red apple sitting on the counter, where she had left it after buying it at the grocery store the day before. As she leaves home, she tells her son, Joe, that there is a red apple sitting on the kitchen counter and to make sure to pack it with his lunch.

Does Pat know that there is a red apple sitting on the kitchen counter? [Yes/No]

In Knowledge and Non-Knowledge, participants first saw the respective case before they read the Truetemp case, whereas in Baseline, they only saw the Truetemp case.

6.3 Results

Figure 6 shows the proportion of knowledge attributions in the three conditions. In Baseline, 60.54% [95% CI 53.10%; 67.63%] of the participants ascribed knowledge in the Truetemp case. Knowledge ascriptions for Truetemp were marginally higher when a clear case of knowledge was previously presented, 64.86% [57.52; 71.73], and they were highest when a clear case of non-knowledge was presented first, 68.11% [60.87; 74.75]. Knowledge attributions across the three conditions do not differ significantly, χ22, N=555 = 2.33, p = .312.

Fig. 6
figure 6

Percentages of knowledge attribution in the Truetemp case as a function of order of presentation (i.e., Truetemp presented first vs. after a clear case of knowledge vs. after a clear case of non-knowledge). Error bars represent 95% confidence intervals

We had also preregistered z-proportion tests to assess whether the rate of knowledge attributions in Baseline differs from each of the two other conditions. However, testing whether knowledge attribution decreases if the Truetemp is presented after a clear case of knowledge became redundant since percentages were higher in the Knowledge condition than in the Baseline condition (χ21, N=370 = 0.74, p = .39, two-tailed; unadjusted). Comparing Baseline with Non-Knowledge failed to show the predicted effect, with knowledge attributions not being significantly higher when a clear case of non-knowledge was presented first, χ21, N=370 = 2.308, p = .064 (one-tailed).

6.4 Discussion

Experiment 3 provides further evidence that reading the Truetemp case after a clear case of knowledge does not influence knowledge attribution in the Truetemp case to any significant degree. In contrast to our own Experiment 2, and pace the relevant portion of Wright’s (2010) Experiment 2 that we replicated here, we did not find any effect of previously presented scenarios on a version of the Truetemp case that was near-identical to Lehrer’s original presentation, in which the protagonist Mr. Truetemp acquired his ability to correctly assess the temperature via a neurosurgical operation.

7 No philosophical p-hacking

Before we discuss some general conclusions suggested by our findings, we would like to emphasize one key feature of our replications, namely, that they are extremely high-powered, in fact to a degree that is unusual even in the age of online studies, and which by far surpasses the average experimental philosophy study. Therefore, everything else being equal, our results have a considerably higher evidential value than previous findings, especially with respect to the metaphilosophical conclusions of proponents of the negative program in experimental philosophy.

In Experiments 1 and 3, the evidence for a philosophically significant instability of knowledge attributions in Truetemp cases was virtually non-existent. Experiment 1 used Swain et al.’s original design. The barely significant (with a one-sided test) and very small effect that we found when the Truetemp case followed a clear case of knowledge hardly warrants any metaphilosophical attention, even if it turned out to be robust, which would require further investigation to establish. In any case, such tiny effects can hardly be used to challenge the reliability of knowledge attributions in Truetemp cases—let alone the evidential standing of the method of cases more generally, as it was boldly claimed by Swain et al. (2008).Footnote 17

Perhaps Swain et al. will respond that small effects can compound to result in philosophically significant differences (for a related argument, see Machery, 2017, p. 108 and the exchange between Alexander & Weinberg, 2020 and Machery, 2020b). The additive effect of small effects is an important aspect of the metaphilosophical debate about the method of cases, but first more work is called for to be confident of the reality of this effect (remember, it was only significant in a one sided test); second, appealing to the composition of small effects would be a radical departure from the argument in Swain et al.; finally, even if the tiny effect observed here could be combined additively with other small effects, the persuasive weight of the argument would then mainly be borne by these other effects, given how small the effect observed in our Experiment 1 really is.

Experiment 3 replicated the Truetemp-involving portion of Wright’s (2010) second experiment, but we observed no order effect whatsoever, neither after a clear case of knowledge, nor after a clear case of non-knowledge. This is especially remarkable given that the Truetemp vignette used both by Wright in her second experiment and in our replication is an almost verbatim quote of Keith Lehrer’s original presentation of the case (Lehrer, 1990, pp. 163–164). As to this, Swain et al. (2008) explicitly suggest that the alleged order effects with the Truetemp case mirror certain aspects of Lehrer’s own presentation, and that Lehrer had thus inadvertently taken advantage of the influence of context to convince his readers of the unacceptability of reliabilism (2008, pp. 145–146; emphasis in the original):

In the section immediately preceding presentation of the Truetemp Case, Lehrer discusses paradigm cases of knowledge: perceptual knowledge, knowledge arrived at through communication with others, and knowledge of mathematics. We are not suggesting that Lehrer intentionally manipulated any evidence appealed to as part of his case against epistemological reliabilism; rather, we are concerned that philosophers might be manipulating their own results without even being aware that such manipulation is taking place. Our findings suggest that Lehrer’s readers’ unwillingness to attribute knowledge to Mr. Truetemp may be influenced by the preceding cases; if the Truetemp Case were presented without those preceding cases, readers might be less confident about denying that Mr. Truetemp has knowledge.

Using Lehrer’s original vignette instead of the slightly different and simpler vignette of Swain et al., our Experiment 3 shows that knowledge attribution does neither appear to be contextually affected by a clear case of knowledge, nor by a clear case of non-knowledge. We thus take our results to fully exonerate Lehrer of unintentional “philosophical p-hacking” (i.e., of capitalizing on questionable rhetorical procedures that make one’s claim more convincing).

Experiment 2, however, successfully replicated Wright’s (2010) Experiment 1. It appears to provide partly inconsistent evidence with the results of our Experiments 1 and 3. Knowledge attribution in the Truetemp case was substantially affected by previously reading a clear case of non-knowledge (as was also the case in Wright, 2010, but not in our Experiment 3) and a Fake-Barn case (the latter in striking contrast to Wright’s own 2010 results). Thus, we basically successfully replicated Wright’s original difference between the Knowledge (41%) and Non-Knowledge condition (57%), but not her Fake-Barn results: While the knowledge ratings for Truetemp were quite low (26%) after the Fake-Barn case in Wright’s original study, they were comparatively high in our replication (52%). On the other hand, in line with our Experiments 1, 2, and 3, reading a clear case of knowledge beforehand made no philosophically relevant difference.

What to make of these results? In light of the converging results of our Experiment 1, our Experiment 3, and Ziółkowski’s (2021) three replication experiments, we tentatively lean toward thinking that the results of Experiments 2 are false positives. However, one could perhaps conclude more carefully that either the results of Experiment 2 are false positives, or the results of Experiment 3 are false negatives. Even if this second interpretation were correct, the ensuing uncertainty about whether Truetemp intuitions are stable or unstable would make it inappropriate to simply continue appealing to Swain et al. (2008) and Wright (2010) in metaphilosophical debates. Finally, our Experiment 2 might indicate that order effects on Truetemp intuitions are heavily dependent on intricate features of the experimental design: the nature of the scales, the exact wording of the vignette, the other vignettes that happened to be used, etc. Perhaps Swain and colleagues will think this interpretation of our results reinforces their original point, but we doubt it. Even if the interestingly large and significant result for the comparison of Non-Knowledge with Baseline that we observed in Experiment 2 were a robust effect, which would again require further studies to establish, the metaphilosophical significance of the result would be limited. For, this effect would only occur in Wright’s yes/no format and with Swain et al.’s version of the Truetemp vignette, but not when, e.g., Swain et al.’s 5-point Likert scale is used, as in Experiment 1—conditions that have little to do with the typical use of the relevant intuitions in philosophy, including Lehrer’s (1990) seminal Theory of Knowledge.

So, all in all, what do our experiments reveal about the instability of Truetemp intuitions? At the very least, they show that we should not continue to rely on Swain et al. (2008) and Wright (2010) in metaphilosophical discussions.Footnote 18 Second, they show that considering a clear instance of knowledge before reading the Truetemp case doesn’t impact knowledge attribution, and that Lehrer also did not inadvertently take advantage of the instability of Truetemp intuitions. Third, we believe that the weight of the evidence suggests that Truetemp intuitions are in general not significantly influenced by their textual context, although we concede that Experiment 2 might indicate that they are influenced in a few situations. Of course, other manipulations of philosophical intuitions, including by order of presentation, might turn out to be more robust (e.g., the framing of the Gettier case in Machery et al., 2018; the order effects with moral cases in Wiegmann et al., 2012, 2020), although recent work suggests that many philosophical intuitions are not much influenced, if at all, by slight manipulations (e.g., Cova et al., 2021; Horvath & Wiegmann, 2022; Kneer et al., 2021).

Swain and colleagues could perhaps emphasize the differences between our replications and their own study or Wright’s. After all, as pointed out above, our replications are conceptual, not exact replications. It might well be that an exact replication would have obtained results more similar to Swain et al.’s and Wright’s, but we remain skeptical of this possibility, given that the overall pattern of Truetemp-related results rather tells against the problematic instability of Truetemp intuitions. At the very least, we should be agnostic about the potential outcome of an exact replication, and we should therefore refrain from further investigating the methodological implications of the alleged instability of Truetemp intuitions for the method of cases. What’s more, these implications are bound to be very limited at best if the effect manifests itself only under very specific experimental situations that have little to do with our everyday philosophical practice. In contrast, effects that can be robustly detected under a variety of experimental conditions, such as order effects with trolley-style moral scenarios (see, e.g., Wiegmann et al., 2012, 2020), should indeed be a matter of concern for the metaphilosophy of the method of cases. So, we resolutely reject any kind of general skepticism about the metaphilosophical relevance of experimental findings. Rather, we merely insist that just one or two studies are not enough to establish an empirical effect, and that only robustly replicable findings should inform our metaphilosophical and other theorizing. As it seems, order effects on Truetemp cases simply do not make the cut here.

8 Conclusion

None of Swain et al.’s (2008) predictions concerning order effects with Truetemp cases could be consistently and robustly replicated in our three experiments, and it is thus at best unclear whether Truetemp intuitions are in fact unstable. So, if proponents of the negative program in experimental philosophy still want to use order effects to challenge the reliability of philosophical case judgments, they would be well advised to look elsewhere instead (e.g., Liao et al., 2012; Machery et al., 2018; Schwitzgebel & Cushman, 2012, 2015; Wiegmann et al., 2012, 2020). In any case, given the more robust empirical evidence that we presented in this paper, the metaphilosophical flurry created by Swain et al. (2008) and Wright’s (2010) influential studies looks like mere alarmism in hindsight.