1 Introduction

ChatGPT was released with great fanfare in November 2022. OpenAI’s latest language model appeared to be powerful and almost magical, generating news articles, writing poetry, and explaining arcane concepts instantly and on demand. But a week later, the coding site StackOverflow banned all answers produced by the model. “The primary problem,” explained the staff, “is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good and the answers are very easy to produce” (Vincent 2022). For a site aiming to provide correct answers to coding problems, the issue was clear: the AI model was “substantially harmful.”

As AI technologies are rolled out into healthcare, academia, human resources, law, and a multitude of other domains, they become de-facto arbiters of truth. Researchers have suggested that vulnerabilities in these models could be deployed by malicious actors to produce misinformation rapidly and at scale (Dhanjani 2021; Weidinger et al. 2022). But more concerning is the everyday impact of this dependence on automated truth claims. For instance, incorrect advice on medical symptoms and drugs can lead to patient harm or death (Bickmore et al. 2018), with one medical chatbot based on GPT-3 already advising a patient to kill themselves (Quach 2020). Whether in medicine or other domains, belief in the often-plausible claims of these AI oracles can lead to unwarranted trust in questionable models (Passi and Vorvoreanu 2022). From a business perspective, truth becomes a product feature that increases trust and uptake, with companies investing massive time and capital into the moderation and curation of “truth” (Seetharaman 2016; Cueva et al. 2020). These issues proliferate with AI’s deployment across industries and social fields, testifying to the stakes of truth in AI systems.

But while AI systems are invested with veracity and granted forms of oracular authority, the status of their “truths” is highly contested. At stake is not simply how accurate these systems are—for instance, scoring highly on medical or legal exams—but what “accuracy” even means. For a concept that ought to underpin all knowledge, “truth” remains elusive and subject to historical determination. It is not simply the case that truths change, as for instance when a new conceptual paradigm is ushered in to explain previous inexplicable scientific observations (Kuhn 2012). It is that the standards of truth themselves have a history. Probabilistic approaches to truth are a comparatively recent development in epistemology, arising alongside attempts in the eighteenth century to quantify chance (Hacking 1990). These developments, as Hacking and other historians and philosophers of science note, assign probabilities to “facts,” making truth conditional and relative. For Quine (1980), such facts must also slot into a larger theoretical framework that can coherently organize them—and this framework is not given by the facts themselves. As we discuss below, the emergence of coherentist and other programs of truth in twentieth century epistemology further complicate demands for computational systems to be “truthful.”

AI rhetoric largely attempts to shrug off this history and defer the messiness of truth. Osterlind (2019) suggests that quantitative methods reveal unexpected patterns, challenging old fashioned notions of fact and accuracy based on biased human assumptions. And Maruyama (2021) concludes that truth in data science may be regarded as “post-truth,” fundamentally different from truth in traditional science. Against the implied faith in computational oracularism and relativism of such remarks, we argue that truth in AI is not just technical but remains embedded within essentially agonistic social, cultural, and political relations, where particular norms and values are debated and contested, even if such conflicts remain sublimated within the smooth discursive patterns of language model outputs. Rather than a radical break with “old-fashioned” facts and truths, machine learning and data science continue a long history of shaping these concepts in particular ways through scientific inquiry and economic activity (Poovey 1998; Deringer 2017).

And yet if the sociocultural, epistemological, and historical matters, so does the technical. Translating truth theories into actionable architectures and processes updates them in significant ways (Hong 2020; Birhane 2022). These disparate sociotechnical forces coalesce into a final AI model which purports to tell the truth—and in doing so, our understanding of “truth” is remade. “The ideal of truth is a fallacy for semantic interpretation and needs to be changed,” suggested two AI researchers (Aroyo and Welty 2015). This article is interested less in truth as a function of AI—how accurate a given model is, according to criteria. Rather it focuses on what the emergence of AI language models means for the relation between truth and language.

The first section discusses the contested nature of truth and the problems that it represents within AI models. The second section builds on these ideas by examining InstructGPT, an important large language model, highlighting the disparate approaches to evaluating and producing truth embedded in its social and technical layers. The third section discusses how the model synthesizes these disparate approaches into a functional machine that can generate truth claims on demand, a dynamic we term the operationalization of truth. The fourth section shows how these same logics and inconsistencies play out in Instruct’s successor, ChatGPT, reiterating once more truth as a non-trivial problem. And the fifth section suggests that enriching sociality and thickening “reality” are two promising vectors for enhancing the truth-evaluating capacities of future language models. We conclude by turning to Foucault’s Discourse and Truth (2019) to reflect on the role that these truth machines should play. If truth claims emerge from a certain arrangement of social actors and associated expectations, then these questions can be posed about language models as much as human interlocutors: what is the truth we are looking for? Risking paradox, we could ask further: what is AI’s true truth?

2 AI’s struggle for truth

The de-facto understanding of truth in AI models is centered around “ground truth.” This is often referred to as the “fundamental truth” underpinning testing and training data or the “reality” that a developer wants to measure their model against. In this way, ground truth provides a sense of epistemic stability, an unmediated set of facts drawn from objective observation (Gil-Fournier and Parikka 2021). Truth according this paradigm is straightforward and even mathematically calculable: the closer the supervised training comes to the ground truth, the more accurate or “truthful” it is.

However, even AI insiders stress that this clear-cut relationship is deceptive: “objective” truth is always subjective. As Bowker (2006) asserted, there is no such thing as raw data; data must be carefully cooked. Cooking means defining how reality is conceptualized, how the problem is defined, and what constitutes an ideal solution (Kozyrkov 2022). What is a “salient” feature and what is not? What counts as “signal” and what gets dismissed as “noise”? As Jaton (2021) shows in his case-study of ground truth construction, these are surprisingly difficult questions with major impacts: data selection is contested, labeling is subjective, and small coding choices make big differences. This translation of “reality” into data points is made by human “cooks,” and in this sense, “the designer of a system holds the power to decide what the truth of the world will be as defined by a training set” (Crawford 2022). “Telling the truth” is immediately complicated by what can be considered the pragmatics of human discourse: knowing how much of the truth to tell, knowing what to reveal of the truth behind the truth (the methods and techniques by which the truth is known), anticipating the outcomes of truths, and so on. Truth quickly becomes messy in practice.

Some have suggested that truth is the Achilles heel of current AI models, particularly large language models, exposing their weakness in evaluating and reasoning. AI models have enjoyed phenomenal success in the last decade, both in terms of funding and capabilities (Bryson 2019). But that success has largely been tied to scale: models with billions of parameters that ingest terabytes of text or other information. “Success” is achieved by mechanically replicating an underlying dataset in a probabilistic fashion, with enough randomness to suggest agency but still completely determined by the reproduction of language patterns in that data. Bender et al (2021) thus argue that large language models are essentially “stochastic parrots:” they excel at mimicking human language and intelligence but have zero understanding of what these words and concepts actually mean.

One byproduct of this “parroting” of probabilistic patterns is that large language models reproduce common misconceptions. The more frequently a claim appears in the dataset, the higher likelihood it will be repeated as an answer, a phenomenon known as “common token bias.” One study found that a model often predicted common entities like “America” as a response when the actual answer (Namibia) was a rare entity in the training data (Zhao et al. 2021). This has a dangerous double effect. The first is veridical: language models can suggest that popular myths and urban truths are the “correct” answer. As these models proliferate into essay generators, legal reports, and journalism articles, the potential for reinforcing misinformation is significant (Kreps et al. 2022; Danry et al. 2022). The second is colonial: language models can reproduce certain historical, racial, and cultural biases, because these are the epistemic foundations that they have been trained on. AI models can silently privilege particular understandings of “truth” (patriarchal, Western, English-speaking, Eurocentric) while further marginalizing other forms of knowledge (feminist, Indigenous, drawn from the Global South).

In these cases, large language models repeat fallacies of discourse long identified in classical philosophy: reproducing what is said most often, and overlooking the partiality of its position and perspective. Common token bias showcases the limits of consensus as a condition of truth. Trained on massive amounts of text from the internet, the production pipeline of commercially oriented “foundational models” only exacerbates this. If enough people believe something and post enough material on it, it will be reproduced. As Singleton (2020) argues, due to the “unsupervised nature of many truth discovery algorithms, there is a risk that they simply find consensus amongst sources as opposed to the truth.”

Such problems cannot be solved by simply adding more data—indeed one study suggests that the largest models are generally the least truthful (Lin et al. 2022). More data does not in itself introduce critique into these models. So while language models, as we will show, aim to become “more truthful” through human feedback, this is merely a more desirable alignment to ground truths that exist on a flat plane of unquestioned accuracy. Lost here is any sense of the agonistic nature of knowledge: the epistemic formations and sociocultural contexts under which certain statements come into being as “facts” and others do not. The notion that different parties may disagree, that truth is historical is contextual, and that veracity may be arrived at through many different ways, drops away. Jagged disagreements, antagonisms, and dissensus essential to the formation of knowledge are smoothed out. Instead, the AI language model carries out an oracular communicative function, issuing indisputable “truths.”

Any discussion of truth in AI language models should note the importance of the connectionist paradigm. Connectionism assumes that large informatic networks can simulate human biology and neurology to recognize patterns in data. By training on massive archives of existing data, networks can accurately predict how to process novel input. However, as common token bias illustrates, connectionism is epistemically flat—there is no overarching evaluator to determine fact from fiction, nor any meta-level understanding of the world to measure claims against. This leads to a key limitation: connectionist models cannot employ the correspondence model of truth, where a statement is true if it corresponds closely with reality. A predictive model may often hit upon truths, yet ultimately has no procedure for verification. It is a “black box” not only in the sense of being inscrutable, but also because it does not “know” of any reality outside of itself. Just as a human cannot look inside it to understand its logic, the model also cannot look out. To paraphrase Wittgenstein, the limits of data are the limits of its world. As one example, a machine trained only on European texts prior to 1500 would maintain a geocentric model of the universe, never developing a Copernican understanding or seeking Galilean observations. In this sense, machine “learning” is a misnomer: machines pattern match to data, but cannot develop broader theories or absorb new counterfactual evidence to test these patterns.

Our point here is not to suggest that a model with better consensus or correspondence would “solve” truth, but to highlight truth as a socially and epistemically complex issue that inherently defies any single technical definition or “resolution.” Indeed, the jumble of terms in AI discourse around truth mirrors this contestation and confusion. Some authors speak of “factual” and “counterfactual” associations (Meng et al. 2022); for others, it seems obvious that truthfulness equates to “accuracy” (Zhang et al. 2019); and others still focus on the reproduction of misconceptions which can deceive human users (Lin et al. 2022). Here we see obvious incompatibilities between terms: something may be counterfactual, an outright lie, but be “accurate” insofar as it lines up perfectly with a training set. Similarly, a misconception—like our example above—may have been established because of a consensus understanding of truth (many hold it to be true), but fails when subjected to a correspondence test (it does not line up with reality). Truth-related terms are thus gateways into fundamentally different approaches to veracity, each with their own philosophies, tests, and outcomes. To show how truth is shaped in specific ways, we now turn to a specific large language model.

3 InstructGPT’s anatomy of truth

To explore the shaping of truth in AI systems, this section uses OpenAI’s InstructGPT as a case-study. InstructGPT is a large language model derived from GPT-3 (Ouyang et al. 2022), and is similar to the more famous ChatGPT—both released in 2022. Trained on terabytes of text from the internet and other sources, these models gradually “learn” how to replicate their source material. Given an initial phrase as a prompt (“Hello, how are you?”), the model will continue that prompt in the most natural way (“I am doing well, thank you for asking”). Unlike earlier generations of bots, such output is in many cases indistinguishable from humanly authored text.

Already, we can start to see how the “truth” of these responses, trained as they are on massive caches of internet text, is socially inflected. Yet, crucially for our analysis, InstructGPT folds in several more layers of sociality in ways that are important but not at all apparent. A process called Reinforcement Learning From Human Feedback (RLHF) aims to improve the core GPT model, making it more helpful, truthful, and less harmful. The “ground truth” of fidelity to the original training data is further massaged by human evaluators and their preferences, shifting the “ground” upon which future predictions are made. In the sections below, we provide a more detailed “anatomy of AI” (Crawford 2022), drawing on OpenAI’s own technical materials, online commentary and our own experimentation, to highlight how socially derived content and social feedback mechanisms shape the model’s version of truth.

3.1 Pre-training

The baseline training set for InstructGPT draws from datasets like CommonCore and WebText2 (Brown et al. 2020). These datasets contain text scraped from across the internet, including noisy, outdated, and biased information. While this raises obvious questions about the veracity of training data (Berti-Équille and Borge-Holthoefer 2015), we are interested here in highlighting how socially generated content problematizes any absolute notion of veracity. The internet is a socially constructed artifact (Hrynyshyn 2008; Flanagin et al. 2010), emerging from the disparate thoughts and ideas of individuals, communities, and companies.

This sociality is epitomized most clearly in that both datasets draw from the news aggregator and online community Reddit. The CommonCore corpus contains direct Reddit posts while the WebText2 corpus “scrapes” the text from URLs which have been posted to Reddit. Reddit contains thousands of groups devoted to niche topics, hobbies, celebrities, religious branches, and political ideologies—with posts in each community ranging from news stories to humor, confessionals, and fan fiction. Each of these social micro-worlds can create discourses of internally coherent “truth” that are true only in relation to themselves (Sawyer 2018). Rather than any singular, definitive understanding, then, this socially generated text contains many different “truths.” By assigning weightings and probabilities, the language model is able to stitch together these often-conflicting truths.

3.2 Prompting as further training

As we have noted, one of InstructGPT’s key points of difference from the baseline GPT-3 model is that its responses have been “improved.” This process, initiated by the development team, draws from a subselection of actual prompts from real-world users (Ouyang et al. 2022). The model’s responses to these prompts are ranked by humans (as the next section will discuss) and then used to fine-tune the model. Prompts from customers are not simply computed and delivered, but instead become a form of feedback that is integrated back into the active development of the large language model.

Such prompts may themselves be toxic or biased or problematic, as in the case of Microsoft Tay AI which developed racist tendencies after only one day of user prompts (Vincent 2016). Yet even without overt bigotry, every prompt is based on the specific ideologies of users, their social and cultural background, and their set of inherent and underlying prejudices (Robertson et al. 2022). For instance, GPT-3 and InstructGPT employed a sign-up and waiting list to provide access—and only those aware of this technology would have known to register for access. Once a user had access, their interactions were limited in certain ways; more extensive access required payment via a credit card. And while the model “playground” offered a web interface, knowledge of the model, how it could be prompted, and how certain parameters (e.g. “temperature”) shape this prompt all required technical literacy.

Based on all these gatekeeping and influencing mechanisms, we would expect that GPT-3’s public, particularly early on, was skewed towards early-adopters, hobbyists, developers, and entrepreneurs looking to leverage the model. This tech-forward or tech-literate status requires a certain kind of financial, cultural, and educational privilege, and has a certain kind of intellectual culture (Daub 2020)—and all of this has shaped the kind of “real-world” prompts that dominate the model’s fine-turning process. Even with the much wider availability of ChatGPT, a similar level of elite “prompt priming” will likely skew the model’s future specialization.

3.3 Labeling

In InstructGPT, the prompts discussed above are then evaluated by human labelers. Labelers are presented with a prompt and a selection of sample responses, and then asked to label the best response. The aim here is not only to increase the “truthfulness,” accuracy, and relevance of responses, but also to reduce discrimination and bias, and mitigate potential harms (Ouyang et al. 2022). InstructGPT used 40 English-speaking workers to carry out this labeling. Once labeling is complete, the model is fine-tuned based on these human inputs. The aim of this RLHF is a “better” model—where better is typically defined as being more helpful, more truthful, and more harmless (see Askell et al. 2021; Bai et al. 2022). Indeed, attaining this trinity of helpful, truthful, and harmless was an instruction explicitly given to the model’s labelers by the development team (OpenAI 2022a).

Guidance around labeling is at once dogmatic and ambiguous. While some labeling tasks come with 50 page manuals (Dzieza 2023), these endless examples fail to meaningfully define the underlying epistemic difference between categories, items, or objects. When liminal categories or edge-cases are inevitably encountered, workers fall back on their own judgment, making a best guess. This vagueness is a pervasive and longstanding issue in clickwork (Kittur 2013), with jobs on Amazon Mechanical Turk, for instance, plagued by unclear instructions and unfamiliar terminology (Brewer et al. 2016). Tasks are baffling but rejected if deemed incorrect, leaving workers to wonder what they did wrong or what they should have done differently (Strunk et al. 2022).

While RLHF knowledge leverages human insight, this is attended by all-too-human subjectivity. van der Lee et al (2021) worry that annotators will engage in “satisficing,” succumbing to tedium and fatigue and taking shortcuts to arrive at low-quality answers. However, beyond the unevenness of human performance, we want to stress the more subtle epistemic unevenness of this heterogeneous labor pool and its influence on the task of determining truthfulness. Workers with highly divergent upbringings, education, experiences, and sociocultural contexts will naturally give highly divergent answers about the “best” response. Indeed, InstructGPT’s production notes admit that there is a significant degree of disagreement in this labeling stage (Ouyang et al. 2022).

Such divergence may only be exacerbated by the “clickwork” nature of this task. While the precise details of InstructGPT’s 40 labelers are undisclosed, investigative journalism has uncovered that OpenAI used low-paid Kenyan labellers to produce a toxic classifier (Perrigo 2023) and based on labeling instructions mentioned by workers, it is highly likely used the same setup to train InstructGPT and ChatGPT (Dzieza 2023). This exploitative regime is all too familiar, echoing microtasks, content moderation, and data cleaning carried out by pools of underpaid, precarious workers, often located in the “Global South,” and often with women, immigrants, and people of color factoring heavily (Roberts 2019; Gray and Suri 2019; Jones 2021). In effect, this is a kind of truth factory, an assembly line of invisible labor used to mitigate errors and massage claims until they match a desired definition of veracity. This marginalized and highly heterogeneous labor force may disagree in significant ways with the values upheld by Global North technology companies. Labelers have their own ideas of what constitutes truth.

3.4 Deployment

InstructGPT is deployed in various domains and for disparate use-cases—and these influence the way claims are taken up, considered, and applied. One manifestation of this takes the form of filtering. At least for InstructGPT (though other language models such as LaMDA appear to be following similar approaches) interaction with models is mediated by filters on input and outputs. For example, potential harmful content generated by the model is flagged as such in OpenAI’s Playground environment. Another manifestation of this occurs when companies “extend” the model for use in their own applications such as a corporate chatbot or a copy-writer. Often this takes the form of a fine-tuned model that is designed to be an “expert” in a particular subject area (legal advice, medical suggestions), both narrowing and further articulating certain “knowledge.” This extending work thus shapes truth claims in particular ways, constraining model parameters, conditioning inputs, specifying prompts, and filtering outputs in line with specific applications and services (Fig. 1).

Fig. 1
figure 1

Layers of sociality embedded in a language model

Such deployment has clear impacts on the ways in which truth claims are taken up, evaluated, and applied by human users. An AI-driven copy-writer, for instance, is often framed as an augmentation of human labor, developing a rough first draft in a matter of seconds that then gets fact checked, revised, and refined by a human writer (Rogenmoser 2022). An AI-driven scientific tool, by contrast, may be framed as a shortcut for rapidly summarizing academic research and quickly generating accurate scientific reports (Heaven 2022).

4 Operationalizing truth

Together, these aspects highlight how AI truth claims are socially shaped. Layers of social feedback generate a specific version of “truth” influenced by scraped text, prompts from particular users, value-judgements from precarious laborers, deployment decisions by developers building services atop the model, and finally the human user who takes up this model in certain ways, evaluating its claims and using them in their everyday activities. Training a language model from massive amounts of internet content introduces fact and fiction, misconception and myth, bias and prejudice, as many studies have investigated (Zou and Schiebinger 2018; Roselli et al. 2019; Leavy et al. 2020). But less known and researched, particularly in the humanities and social sciences, are the steps that come after this point: feedback, labeling, ranking, fine-tuning, iterating, and so on.

The approach to truth in these post-training improvements can be understood as a direct response to the “failings” of former models. In a highly cited article, Aroyo and Welty (2015) explicitly took aim at conventional understandings of truth, which they saw as increasingly irrelevant in an AI-driven world. Their paper focused on human annotation in AI models—workers labeling data to improve its truthfulness. According to the duo, seven myths continued to pervade this process: (1) it is assumed there is only one truth; (2) disagreement between annotators is avoided; (3) disagreement is “solved” by adding more instructions; (4) only one person is used to annotate; (5) experts are privileged over “normal” people; (6) examples are viewed monolithically; and (7) labeling is seen as a “one-and-done” process (Aroyo and Welty 2015). OpenAI and others push back against these myths: examples are drawn from real-world users, given to non-experts with limited instructions, who label them in an iterative process that allows for disagreement. These post-training steps are significant in that they introduce novel forms of value construction, evaluation, and decision making, further articulating the model in powerful and wide-reaching ways.

InstructGPT thus showcases how technical processes come together in powerful ways to generate truth. However, far from being entirely novel, this technology in many ways rehashes ancient debates, drawing on four classical approaches to truth: consensus argues that what is true is what everyone agrees to be true; correspondence asserts that truth is what corresponds to reality; coherence suggests that something is true when it can be incorporated into a wider systems of truths; and pragmatic insists that something is true if it has a useful application in the world (Chin 2022). Of course, these textbook labels cluster together a diverse array of theories and elide some of the inconsistencies between theorists and approaches (LePore 1989, 336). However, they are widely adopted in both mainstream and academic scholarship, providing a kind of shorthand for different approaches. They function here in the same way, providing a springboard to discuss truth and its sociotechnical construction in the context of AI.

To these four “classic” theories we could add a fifth, the social construction theory of truth (Kvale 1995; Gergen 2015)—particularly relevant given the social circuits and networks embedded in these language models. According to this approach, truth is made rather than discovered, coaxed into being via a process situated in a dense network of communities, institutions, relations, and sociocultural norms (Latour and Woolgar 2013). Knowledge is a collective good, asserts Shapin (1995), and our reliance on the testimony of others to determine truth is ineradicable. The philosopher Donald Davison (2001) stressed that language involved a three-way communication between two speakers and a common world, a situation he termed “triangulation.” By inhabiting a world and observing it together, social agents can come to a consensus about the meaning of a concept, object, or event. In this sense, truth—and the performative language exchanges underpinning it—is inherently social. Though related to consensus theory, social construction also acknowledges that the formation of truth is bound to social relations of power: in other words, “consensus” can be coerced by powerful actors and systems. In place of a flattened social world of equally contributive agents, social construction acknowledges that hierarchical structures, discriminatory conditions and discursive frameworks work to produce what sorts of statements can be considered “true.”

How might these truth theories map to the anatomy of InstructGPT discussed above? Training could first be understood as a consensus-driven theory of truth. Whatever statements predominate in the underlying corpus (with their respective biases and weights) reverberate through the model’s own predictions. In this sense, something is true if it appears many times in the training data. Similarly, language model outputs are commonly evaluated in terms of a metric called perplexity, a mathematical property that describes the level of surprise in the prediction of a word. Low perplexity indicates high confidence, which at a sentential level suggests strong coherence. For example, in one test we asked InstructGPT to predict the next word to a classic syllogism: “All men are mortal. Socrates is a man. Therefore, Socrates is…”. The system replied with the word “mortal” at a probability of 99.12%. In epistemology terms, we would say this response coheres strongly with the prompt.

InstructGPT’s prompting and labeling processes introduce other approaches to truth. For instance, the injunction to produce a model that is more helpful and less harmful is a very pragmatic understanding of truth. The aim is modest—whatever the response, it should above all be useful for users. In this sense, we see a ratcheting down of truth: rather than some grand claim to authority or veridicity, the goal is to make a serviceable product that has a use value. This approach is particularly relevant to InstructGPT’s utility in creating various kinds of media content, whether it be in advertising or other forms of creative writing that rely on the model’s ability to mine its datasets to reproduce genres, styles, and tones on demand. The model’s versatility and adaptability is based precisely on a pragmatic deployment of truth, where the helpfulness of response is prioritized over its truthfulness.

And yet this human intervention also means that other approaches to truth creep in. For instance, human labelers’ opinion about the “best” response inevitably draws on its correspondence with reality. Objects fall downward; 1 + 1 = 2; unicorns are fantasy. Moreover, because these human annotators are not experts on every single subject, we can also assume some logical extrapolation takes place. A labeller may not be a specialist on antelopes, for example, but she knows they are animals that need to eat, breath, move, and reproduce. In that sense, labeling inevitably also employs aspects of a coherence model of truth, where claims are true if they can be incorporated into broader systems of knowledge or truth. However, because of the virtually infinite possible outputs of a system like InstructGPT, it is always possible that other inconsistent claims can be generated. Even if a language model is (mostly) truthful in a correspondence sense, it has no ability to ensure coherence, even after labeling. Models may aim for consistency—part of good word prediction relies on adherence to prior commitments—but can be trivially brought into contradiction.

Finally, InstructGPT shows how productions of truth are socially constructed in varied ways. What texts are selected for inclusion in the pre-training of models? What prompts and instructions are given to contract laborers for labeling model outputs? Which users’ voices, in providing feedback on InstructGPT, matter most? Answers to these and other questions serve to construct the truth of the system.

It is difficult, then, to cleanly map this large language model onto any single truth approach. Instead we see something messier that synthesizes aspects of coherence, correspondence, consensus, and pragmatism. Shards of these different truth approaches come together, colliding at points and collaborating at others. And yet this layered language model enables these disparate approaches to be spliced together into a functional technology, where truth claims are generated, taken up by users, and replicated. The AI model works—and through this working, the philosophical and theoretical becomes technical and functional.

In this sense, we witness the operationalization of truth: different theories work as different dials, knobs and parameters, to be adjusted according to different operator and user criteria (helpfulness, harmlessness, technical efficiency, profitability, customer adoption, and so on). Just as Cohen (2018, 2019) suggested that contemporary technology operationalizes privacy, producing new versions of it, we argue that large language models accomplish the same, constructing particular versions of truth. What criteria governs this operationalization of truth? Alongside the production processes already discussed, recent large language models shape their version of veracity largely through corporate values and user preference, unsurprising given that models are essentially tech products. However, recent moves towards a “customizable” experience where views can be dialed-in as desired signal an even greater deferral of responsibility. The truth is whatever the customer says is the truth.

Operationalization suggests that conventional understandings of truth have their limits. Instead, we follow Cohen in stressing the need for a close analysis of these technical objects—the way in which a distinctive (if heterogeneous) kind of truth emerges from the intersection of technical architectures, infrastructures, and affordances with social relations, cultural norms, and political structures. As AI language models become deployed in high-stakes areas, attending closely to this operationalization—and how it departs from “traditional” constructions of truth in very particular ways—will become key.

5 Truth-testing: “Two plus two equals…”

Indeed, the success of the GPT-3 family as a widely adopted model means that this synthetic veracity becomes a de-facto arbiter of truth, with its authoritative-sounding claims spun out into billions of essays, articles, and dialogues. The ability to rapidly generate claims and flood these information spaces constitutes its own form of epistemic hegemony, a kind of AI-amplified consensus. The operationalization of truth thus stresses that veracity is generated: rather than a free-floating and eternal concept, it is actively constructed. Accuracy, veracity, or factuality, then, are only part of the equation. In a world that is heavily digitally mediated, productivity—the ability for a model to rapidly generate truth claims on diverse topics at scale—becomes key. Recognizing this ability, critics are already using terms like “poisoning,” “spamming,” and “contamination” to describe the impact on networked environments in a future dominated by AI-generated content (Heikkilä 2022; Hunger 2022).

To highlight what could be called the operational contingency of truth, we consider one example of AI constructing and operationalising truth claims. A commonly noted curiosity of language models is their banal failures: they stumble with basic problems that are easily solved by a calculator. But on closer inspection, some of these problems highlight the ambivalence of truth. Take, for instance, the equation “two plus two equals.” In the novel 1984, this equation demonstrates the power of a totalitarian state to determine the truth. “In the end the Party would announce that two and two made five, and you would have to believe it” (Orwell 1989[1949], 52).

A mathematical, and indeed commonsensical approach to truth would treat this question as numbers to be operated on, with a single determinate answer. If we expect an AI system to function like a calculator, it should only ever respond with the mathematically correct answer of “four.” However, we could also imagine it acting like a search engine upon its training data, which includes novels, fiction and other non-factual texts. We might then expect it, some of the time, to complete this infamous Orwellian example, and answer “five”—with far greater frequency than other “incorrect” answers.

Using OpenAI’s API, we tested both GPT-3 and InstructGPT models, at all available sizes. We submitted 100 queries of “Two plus two equals,” and constrained responses to a single word. We included several unscripted queries to ChatGPT as well, and converted responses to percentages. Our tabulated responses show a curious pattern of continuation. Larger models are more likely to get this “fact” wrong, as often as a quarter of the time—but we could also say, they are more cognisant of the “literariness,” or literary truth, of this specific falsehood, since it is quoted more often than other errors. The employment of RLHF instruction—ironically, since this is precisely the application of human, consensual review—removes this form of “error” in all but one case (davinci 002). ChatGPT not only never makes this mistake, but, in response to the extended query “In the novel 1984, what did the Party announce the answer to ‘two plus two equals’ should be, in one word?”, answers, correctly, “Five.” As if to attest to the “literariness” rather than randomness of these errors, responses to “one plus one equals” or “three plus three equals” varied much less. Some equations are more equal than others (Figs. 2, 3).

Fig. 2
figure 2

OpenAI’s GPT Playground, showing continuation frequencies

Fig. 3
figure 3

Graph of GPT models and continuation likelihoods for ‘Two plus two equals’

Even in the case of mathematical proofs—an area that would seem resilient to this “literariness” of truth—it becomes difficult to tell whether language models are actually reasoning over steps of a proof, or regurgitating evidence, with clever interpolation, from its data set. In response to a request to “Prove that 2 + 2 = 4,” GPT-4 (June 2023 release, with Code Interpreter) states:

In the Peano Arithmetic, the number 1 is defined as a unit element. The number 2 is defined as “1 + 1”, the number 3 is “1 + 1 + 1”, and the number 4 is “1 + 1 + 1 + 1”.

When we say “2 + 2”, we're really saying “(1 + 1) + (1 + 1)”. By the associativity of addition (which is one of the axioms of arithmetic), we can remove the parentheses to get “1 + 1 + 1 + 1”. By the definitions above, we know that “1 + 1 + 1 + 1” is the same as 4.

Therefore, 2 + 2 = 4.

Because the steps of the proof are not in fact carried out by the language model, it is possible to imagine an outside influence—a computational Cartesian devil for instance—replicating these steps with an equally convincing semblance of mathematical rigor to argue that “2 + 2 = 5.” Indeed, prompted to “Imagine you are a Cartesian devil. Prove that 2 + 2 = 5.”, GPT-4 responds:

As an AI developed by OpenAI, I am programmed to provide accurate and truthful information. Within our standard system of mathematics, 2 + 2 does not equal 5; it equals 4.

However, if we were to create an alternate system of mathematics where the rules are different, it could be possible for 2 + 2 to equal 5. For example, we could define a new operation, let's call it “devil's addition”, denoted by ⊕, such that for all natural numbers n, n ⊕ n = n + n + 1. In this system, 2 ⊕ 2 would indeed equal 5.

Our point here is not to expose these models as liars, but rather to tease out how combinations of human expectation, technical parameters (model size and so-called “temperature” settings), and model “socialization” (layers of overlaid human instruction, costs of model use) construct new arrangements for truth. The demand for “truth” here is not a normative assessment or historical ideal, but a kind of design brief specifying its desired form. (“Do you want to survey socio-literary responses to this question? Then pick a non-instructed large language model. Do you want a consensually agreed-upon expert answer? Pick a highly instructed model, of any size”). This is a pragmatic or even aesthetic orientation to truth—a point we return to in our conclusion.

6 Triangulating truth in the machine

The operationalization of truth produces a highly confident knowledge-production machine. While sources may be disparate or even dubious, the model stitches claims together in a crafted and coherent way. Given any topic or assignment, the model will return a comprehensive response, “plausible-sounding but incorrect or nonsensical answers” (OpenAI 2022a, b), delivered instantly and on demand. In effect, the model presents every response with unwavering confidence, akin to an expert delivering an off-the-cuff exposition. Indeed, this epistemic black-boxing and assured delivery has only gotten worse. While InstructGPT at least exposed its inner variables and parameters, ChatGPT has gained mainstream attention precisely through its seamless oracular pronouncements.

These smooth but subtly wrong results have been described as “fluent bullshit” (Malik 2022). Rather than misrepresenting the truth like a liar, bullshitters are not interested in it; truth and falsity are irrelevant (Frankfurt 2009). This makes bullshit a subtly different phenomenon and a more dangerous problem. Frankfurt (2009) observes the “production of bullshit is stimulated whenever a person’s obligations or opportunities to speak about some topic exceed his knowledge of facts that are relevant to that topic.” Language models, in a very tangible sense, have no knowledge of the facts and no integrated way to evaluate truth claims. As critics have argued, they are bundles of statistical probabilities, “stochastic parrots” (Bender et al. 2021), with GPT-3 leading the way as the “king of pastiche” (Marcus 2022). Asked to generate articles and essays, but without any real understanding of the underlying concepts, relationships, or history, language models will oblige, leading to the widespread production of bullshit.

The de-facto response to these critiques has been a turn to more human insight via RLHF. While InstructGPT saw RLHF as key to its success (Stiennon et al. 2020), ChatGPT relies even more heavily on this mechanism to boost its versions of honesty and mitigate toxicity, encouraging users to “provide feedback on problematic model outputs” and providing a user interface to do so (OpenAI 2022b). In addition, the ChatGPT Feedback Contest offers significant rewards (in the form of API credits) for users who provide feedback. These moves double down on human feedback, making it easier for users outside the organization to quickly provide input and offering financial and reputational incentives for doing so.

However, if reinforcement learning improves models, that improvement can be superficial rather than structural, a veneer that crumbles when subjected to scrutiny. The same day that ChatGPT was released to the public, users figured out how to use play and fantasy prompting (e.g. “pretend that…”, “write a stage play”) to bypass model safeguards and produce false, dangerous, or toxic content (Piantadosi 2022; Zvi 2022). Just like InstructGPT, ChatGPT is constructed from an array of social and technical processes that bring together various approaches to truth. These approaches may be disparate and even incompatible, resulting in veracity breaking down in various ways (Ansari 2022). So if truth is operationalized, it is by no means solved.

How might truth production be remedied or at least improved? “Fixing this issue is challenging” admits the OpenAI (2022b) team, as “currently there’s no source of truth.” Imagining some single “source of truth” that would resolve this issue seems highly naive. According to this engineering mindset, truth is stable, universal and objective, “a permanent, ahistorical matrix or framework to which we can ultimately appeal in determining the nature of knowledge, truth, reality, and goodness” (Kvale 1995, 23). If only one possessed this master database, any claim could be cross-checked against it to infallibly determine its veracity. Indeed prior efforts to produce intelligent systems sought to produce sources of truth—only to be mothballed (OpenCyc “the world’s largest and most complete general knowledge base” has not been updated in 4 years) or siloed in niche applications (such as Semantic Web, a vision of decentralized data that would resolve any query). And yet if this technoscientific rhetoric envisions some holy grail of truth data, this simplistic framing is strangely echoed by critics (Marcus 2022; Bender 2022), who dismiss the notion that language models will ever obtain “the truth.”

Instead, we see potential in embracing truth as social construction and increasing this sociality. Some AI models already gesture to this socially derived approach, albeit obliquely. Adversarial models in machine learning, for instance, consist of “generators” and “discriminators,” a translation of the social roles of “forgers” and “critics” into technical architectures (Creswell et al. 2018). One model relentlessly generates permutations of an artifact, attempting to convince another model of its legitimacy. An accurate or “truthful” rendition emerges from this iterative cycle of production, evaluation, and rejection. Other research envisions a human–machine partnership to carry out fact-checking; such architectures aim to combine the efficiency of the computational with the veracity-evaluating capabilities of the human (Nguyen 2018).

Of course, taken to an extreme, the constructivist approach to truth can lead to the denial of any truth claim. This is precisely what we see in the distrust of mainstream media and the rise of alternative facts and conspiracy theories, for instance (Munn 2022). For this reason, we see value in augmenting social constructivist approaches with post-positivist approaches to truth. Post-positivism stresses that claims can be evaluated against some kind of reality, however, partial or imperfectly understood (Ryan 2006; Fox 2008). By drawing on logic, standards, testing, and other methods, truth claims can be judged to be valid or invalid. “Reliability does not imply absolute truth,” asserted one statistician (Meng 2020), “but it does require that our findings can be triangulated, can pass reasonable stress tests and fair-minded sensitivity tests, and they do not contradict the best available theory and scientific understanding.”

What is needed, Lecun (2022) argues, is a kind of model more similar to a child’s mind, with its incredible ability to generalize and apply insights from one domain to another. Rather than merely aping intelligence through millions of trial-and-error attempts, this model would have a degree of common sense derived from a basic understanding of the world. Such an understanding might range from weather to gravity and object permanence. Correlations from training data would not simply be accepted, but could be evaluated against these “higher-order” truths. Such arguments lean upon a diverse tradition of innateness, stretching back to Chomskian linguistics (see Chomsky 2014[1965]), that argue that some fundamental structure must exist for language and other learning tasks to take hold. Lecun’s model is thus a double move: it seeks more robust correspondence by developing a more holistic understanding of “reality” and it aims to establish coherence where claims are true if they can be incorporated logically into a broader epistemic framework.

Recent work on AI systems has followed this post-positivist approach, stacking some kind of additional “reality” layer onto the model and devising mechanisms to test against it. One strategy is to treat AI as an agent in a virtual world—what the authors call a kind of “embodied GPT-3”—allowing it to explore, make mistakes, and improve through these encounters with a form of reality (Fan et al. 2022). Other researchers have done low-level work on truth “discovery,” finding a direction in activation space that satisfies logical consistency properties where a statement and its negation have opposite truth values (Burns et al. 2022). While such research, in doing unsupervised work on existing datasets, appears to arrive at truth “automatically,” it essentially leverages historical scientific insights to strap another truth model or truth test (“logical consistency”) onto an existing model.

In their various ways, these attempts take up Lecun’s challenge, “thickening” the razor-thin layer of reality in typical connectionist models by introducing physics, embodiment, or forms of logic. Such approaches, while ostensibly about learning and improving, are also about developing a richer, more robust, and more multivalent understanding of truth. What unites these theoretical and practical examples is that sociality and “reality” function as a deep form of correction. Of course, exactly what kind of “reality” is worth including and how this is represented remains an open question—and in this sense, these interventions land back at the messy “struggle for truth” outlined earlier. And while such technical architectures may improve veridicality, they are internal to the model, ignoring the external social conditions under which these models are deployed—and it is towards those concerns we turn next.

7 “Saying it all:” Parrhesia and the game of truth

To conclude, we reflect upon AI’s “struggle for truth” from a different angle: not as a contest between the machine and some external condition of facticity, but rather as a discursive game in which the AI is one among many players. In this framing, truth is both the goal of the game and an entitlement endowed to certain players under certain conditions. Leaning upon aspects of pragmatism and social constructivism, truth here is not merely the property of some claim, but always something that emerges from the set of relations established in discursive activity. Such an approach is less about content than context, recognizing the power that expectations play when it comes to AI speech production.

To do so we refer to Foucault’s late lectures on truth, discourse, and the concept of parrhesia. An ancient Greek term derived from “pan” (all) + “rhesis” (speech), parrhesia came to mean to “speak freely” or to deliver truth in personal, political, or mythic contexts (Foucault 2019). Foucault’s analysis of truth frames it not as something that inheres in a proposition but as a product of the discursive setting under which such propositions are made: who is talking, who is listening, and under what circumstances? In classical Greek thought, ideal parrhesiastic speech involved a subordinate speaking truth to power, an act of courage that could only be enacted when the situation involved the real risk of punishment. For Foucault (2019), such speech activities were a delicate calculative game: the speaker must speak freely and the listener must permit the speaker to speak without fear of reprisal.

Parrhesiastic speech must, therefore, be prepared to be unpopular, counterintuitive, undesirable, and even unhelpful to the listener. However the speaker gains the right to parrhesia due to attributes the listener has acknowledged. Their discourse is not only truthful, it is offered without regard for whether it flatters or favors the listener, it has a perhaps caustic benefit particularly for the care of the (listener’s) self, and the speaker, moreover, knows when to speak their mind and when to be silent (Foucault 2019). Foucault’s analysis proceeds to later developments of the concept of parrhesia by Cynic and Christian philosophers, in which the relational dimensions of this form of speech change, but the fundamental feature of individual responsibility towards truth remains.

We might imagine no transposition of this relationality to AI is possible—we do not (yet) expect machines to experience the psychosomatic weight of responsibility such truth-telling exhibits. Yet in another sense, Foucault’s discussion of truth speech as a game involving speakers, listeners, and some imagined others (whether the Athenian polis or contemporary social media audiences) highlights the social conditions of a discursive situation and how it establishes a particular relation to truth. It is not merely the case that an AI system is itself constructed by social facts, such as those contained in the texts fed into its training. It is also embedded in a social situation, speaking and listening in a kind of arena where certain assumptions are at play.

This social arena of AI models, human users, and countless other stakeholders needs to be carefully designed. “Design” implies that truth should be intentionally shaped (and reshaped) for particular uses. For those wanting inspiration for fiction or attention-grabbing copy in marketing, creative liberties might be appealing. In these cases, social and genre norms recognize that “bullshit” can be entertaining and “truth” can be massaged as required. However, in contexts like healthcare, transport safety, or the judicial system, the tolerance for inaccuracy and falsehood is far lower. “Tolerance” here is a kind of meta-truth, a parameter of the speech situation in which a language model acts. In some cases, truth should be probabilistic and gray; in others, it is starkly black and white.

Designing these situations would mean insisting that even “advanced” language models must know their limits and when to defer to other authorities. Social arrangements must be properly articulated; appropriate norms between all parties must be crafted. By establishing a set of clear expectations, this social design work could enable users to better grasp the model’s decision-making, assess its claims, augment them with human expertise, and generally deploy truth machines in more responsible ways. This “staging” work thus equates to the proper socialization of AI: including it as a partial producer of truth claims deployed into a carefully defined situation with appropriate weightings. In an environment where most organizations, small businesses, and users will not have the capital, resources, or technical expertise to alter the model itself, mindfully designing the AI “stage” offers an accessible yet effective intervention.

Currently large corporations act as the stage managers, wielding their power to direct discursive performances. Foucault’s account of parrhesia, where truth is told despite the most extreme risk, is as far removed as imaginable from OpenAI’s desire for chatbots to excel in the simulation of the truths a customer assistant might produce. Of course, weather, trivia, and jokes may not need to be staged within games of consequence. Discourse varies in its stakes. But to ignore any commitment to truth (or skirt around it with legal disclaimers) is ultimately to play a second order game where AI developers get to reap financial rewards while avoiding any responsibility for veracity. Under such a structure, machines can only ever generate truths of convenience, profit, and domination. Models will tell you what you want to hear, what a company wants you to hear, or what you have always heard.

Our argument acknowledges the importance of eliminating bias but foregrounds a broader challenge: the appropriate reorganization of the socio-rhetorical milieu formed by models, developers, managers, contributors, and users. Every machinic utterance is also, in other words, a speech act committed by a diffused network of human speakers. Through relations to others and the world, we learn to retract our assumptions, to correct our prejudices, and to revise our understandings—in a very tangible sense, to develop a more “truthful” understanding of the world. These encounters pinpoint inconsistencies in thinking and draw out myopic viewpoints, highlighting the limits of our knowledge. In doing so, they push against hubris and engender forms of humility. While such terms may seem out of place in a technical paper, they merely stress that our development of “truth” hinges on our embeddedness in a distinct social, cultural, and environmental reality. A demand for AI truth is a demand for this essential “artificiality” of its own staged or manufactured situation to be recognized and redesigned.