Joint attention is one of the hallmarks of human sociality. Developing the skill of joint attention sets the stage for learning about the world and for understanding the minds of those we share it with. Joint attention’s anthropological value and significance for human development mismatches the ease and effortlessness with which most of us, on a daily basis, accomplish it: joint attention is easily overlooked. Few parents pause in awe when their one-year-old flashes a smile at them in the moment their favorite tune erupts or when their infant shouts “look” while pointing to the moon and glimpsing over to them. But these instances are truly remarkable. Humans are the only creatures that show this behavior by which experiences become interpersonally shared. Today, we know that if an infant rarely engages in this sort of behavior—if she lacks the drive to turn individual experiences into shared ones—she misses out on key experiences that ground her in the cultural world and provide the soil for her further social, cognitive, and emotional growth. It is in joint attention that infants come to adopt their parents’ stance toward the world (Hobson 2002) and to handle objects and confront situations in the ways their community handles and confronts them. Recognizing these vast implications of joint attention for a person’s individual development and for human society at large, Carpenter et al. (1998, p. 2) compared joint attention to the “crossroads where human infants meet the world of collective cognition in which they will remain for the rest of their lives”.

After almost 50 years of research, we know a great deal about the development of joint attention, its significance, and the importance of early treatment if joint attentional abilities are lacking or impoverished. We have also gained more conceptual clarity on what joint attention is and have come to separate it more thoroughly from instances in which two or more individuals are focused on the same thing without jointly attending to it. Joint attention remains an intensively studied area of research: a search for publications with “joint attention” as keyword yields over 35,000 hits for the current year of 2023 alone (it is June of that year as I finish this manuscript). One reason for the continued interest is that after Jerome Bruner and colleagues (Bruner 1981; Scaife & Bruner 1975) first shone the spotlight on joint attention for an audience of developmental psychologists and linguists, researchers from other fields, including philosophy, clinical psychology, anthropology, educational psychology, and, more recently, robotics, have come to realize how crucial joint attention is for language development, an understanding of other minds, and the coordination of interpersonal interaction. In line with Bruner’s pragmatist orientation and his urge to think of joint attention within joint activity, much of today’s empirical and conceptual work on joint attention discusses its role and function in the context of cooperative action (Tomasello 2019; Wilby 2023). Without the ability to put one’s heads together over objects or topics of common interest, it would be very difficult, if not impossible, to cooperate even on a small scale, let alone create and maintain social institutions and societies, because any of these things afford that we knowingly direct our minds to the same problem or state of affairs and address it together. In a (difficult-to-imagine) counterfactual world with no capacity for joint attention, human perception and intentional action would look much more like their analogs in non-human apes than like our actual, joint-attention-informed, ways of seeing and acting on the world.

In this paper, I focus on the ontogenetic beginnings of joint attention. In the first part, I will attempt to define joint attention, and I will survey our current understanding of the development of joint attention in infancy, its human-uniqueness, and its fundamental importance for the child’s cognitive development more generally. In the second part, I will point out empirical and conceptual issues that remain unresolved. This includes questions about the ontogenetic roots of joint attention and its relation to the even more basic skill of engaging in “protoconversations”, which is already in place in early infancy. Another issue concerns the evolution of joint attention. I will introduce two origin stories that give very different explanations of how joint attention came into being, and I will discuss difficulties with making evaluative judgments of these narratives. Identifying areas of ignorance and confusion serves to channel our research effort in what I take to be promising directions for future research.

1 What We Know

1.1 What Is and What Is Not Joint Attention

We have come to better delineate what are and what aren’t cases of joint attention. Malinda Carpenter (e.g., Carpenter & Liebal 2011; Siposova & Carpenter 2019) and Tomasello (1995, 1999) in developmental psychology and Eilan (2005, in press) and John Campbell (2005) and others in philosophy have helped us gain more clarity on what joint attention is and how it is achieved. Working definitions put forth by empirical researchers had falsely suggested that for two people to be in joint attention, they only need to focus on the same object at the same time (e.g., Nagai et al. 2003). Take as an example Gernsbacher et al. (2008, p. 38) who claim that “when one person directs his attention to another person’s focus of attention, the two people are said to be in joint attention”. This is too loose of a definition (see Kaplan & Hafner 2006), as it erroneously includes cases in which minds don’t meet but pass each other like ships in the night. In fact, definitions like these are much too wide, as they would not only capture cases in which agents who are aware of one another’s attention as being directed at the same object are not sharing their individual experiences, but also instances in which one or both attender(s) remain unaware that there is another agent concentrating on the same object. For an act of attention to count as joint attention, two (or more) persons need to not only focus on the same object at the same time, but they need to do so together, as a unit or a “plural subject” (Gilbert, 2007) rather than each individually or for themselves.

We have also learned how this plural subject of joint attention is achieved. Crucially, it does not emerge simply from the co-attenders’ knowledge that the other’s attention is on the same object as their own. Joint attention is not a problem of mutual ascription of epistemic or attentional states (see Kietzmann 2018, for a critique of this model in the context of joint action). If I know that you are looking at this thing, and you know that I am looking at it, too, this does not put us into joint attention. All that this would mean is that, on top of attending to the same thing, we are each—individually or for ourselves—going through a mentalizing exercise whereby we ascribe an attentional state to the other. The problem with this picture of symmetrical attention ascription (see Eilan, in press) is not that it lacks layers of recursive embedding. Take a scenario in which both attenders simultaneously think the thought “I know that you know that I’m looking at this thing”. Even with this thought going through the two people’s minds in parallel, they would not thereby wind up in joint attention. Imagine two people window-shopping at a shoe store. They are standing next to each other in front of the shop, peeking through the glass and eyeballing the same pair of shoes. One of the attenders might not even be aware of the other’s being focused on the same shoes. This is obviously not a case of joint attention. But what also does not put the two attenders in joint attention is knowledge that the other is looking at the same pair of shoes (perhaps fearing that the other will go in and purchase the last pair before they can). This does not change by adding layers of iteration, such that each window-shopper knows that the other knows that they are attending to the same shoes; all this gives us are two individual attenders symmetrically thinking about how they figure as a subject of attention in the other’s mind. We would still lack the plural subject, “we”, or the “being-in-this-together” that characterizes joint attention.

Something else, something other than the symmetrical ascription of nested mental representations, is needed for the two to be in joint attention. As Naomi Eilan (in press) has persuasively shown, this “something else” can only be a communicative act. The persons would need to turn to one another and engage in some sort of exchange by which the object becomes shared. In the shoe example, one attender could glance over to the other and say “Pretty!”, with the other nodding, or both of them could smile at one another and raise their eyebrows. In her emphasis of the need for an exchange of this sort in joint attention, Gilbert (2007) makes recourse to Charles Taylor’s (1985) notion of “entre nous”: an object only becomes an object of joint attention for us when we engage in some sort of verbal or non-verbal exchange about it. Taylor gives the example of a train rider who makes a sweat-wiping gesture across her forehead for a co-present rider to see while uttering “Phew!” to remark on the heat. This referential gesture opens up a public space between the two agents and fills it with the shared object. The sharedness of the attenders’ attention is now “out in the open” (Campbell, 2005) between them. Before the exchange, the object was not “entre eux” or “out in the open” between them. The heat may have been unignorable and on both passenger’s mind, but it wasn’t shared. In short, joint attention requires that agents are not only aware that the other is attending to the same thing they are attending to, but that there is a communicative exchange in which the agents acknowledge the shared nature of the experience.

Some might think that this notion of joint attention is overly strict. But without the criterion of an exchange, there seems to be no way of knowing that the object is shared. In a sense, I take myself to be simply spelling out what seems to be implied in the widespread depiction of joint attention as a triangle with a double-facing arrow connecting two agents: the only thing that can make this arrow bidirectional is an exchange. Note that the communicative act can be ever so small or subtle. Campbell (2011) gives the example of a “coordinated attack”, in which a dyad of agents decides to aim for a target after one of them responded to the other’s pointing gesture with a grin. Invoking the referential-exchange criterion rules out cases that psychologists have problematically considered to be joint attention. For example, the behavior of gaze-following, defined as “looking where someone else is looking” (Butterworth et al., 1991, p. 25) is commonly referred to as “visual joint attention” (e.g., Corkum & Moore 1995). Tracking another’s gaze with one’s eyes, with the result that one sees what the other is seeing, has obvious adaptive value: another’s shift of focus might signal the presence of a salient object or event, such as the presence of food or a threat. Various animals, including apes (Okomato-Barth et al., 2007), dogs (Met et al. 2014), ravens (Bugnyar et al., 2004), turtles (Wilkinson et al. 2010), and cats (Pongracz et al., 2019) can follow the gaze of a conspecific or a human being to objects in their environment. But according to our definition, unless the attenders “close the triangle” by some communicative act, their gaze shifts fall short of joint attention.

Similarly for point-following. Gernsbacher et al. (2008, p. 42) claim that by “turning their heads in the direction of another person’s point”, infants wind up in joint attention with the person who pointed. Although less widespread than gaze-following, following another’s deictic gestures is also not human-unique (see McCreary et al. 2023, for a meta-analytic review). Dogs, for example, look to where humans point with some reliability (Kaminski & Nitschner, 2013), as do enculturated apes (Moore, 2013). But as we made clear in the preceding paragraphs, two agents looking at the same thing, regardless of whether their attention converges on a target because one agent looked or pointed there first, is not enough for their being in joint attention. There needs to be some sort of exchange or mutual acknowledgement that renders the object shared.

A note on the production of pointing is due here. Bates and colleagues (1975) introduced a famous distinction based on the motive with which points are produced. “(Proto)imperative” points are a request to act instrumentally on the pointed-to object, such as to hand it to the one who pointed. “(Proto)declarative” points are a request to share attention to the pointed-to object with the addressee.Footnote 1 Despite this distinction, pointing is often generally considered a joint attentional behavior, regardless of the motive. Corkum & Moore (1995, p. 61), for example, write that “joint attention plays an integral part in both […] protodeclarative and protoimperative gestures.” But this classification is at odds with our definition of joint attention. By pointing imperatively, infants (and enculturated apes) might do nothing more than request that the addressee make the object available to them. This can be accomplished with or without the two parties sharing their attention to the object in the communicative way laid out above. The agent who points might simply have learned that extending the index finger is likely to make the addressee hand her the object, and the addressee might simply know that that is what the agent wants—without the two making eye contact or the like.

What this suggests is that declarative pointing, but not necessarily imperative pointing, kickstarts bouts of joint attention between pointer and recipient. In support of this assessment and of the conceptual separation of two motives for pointing, studies indicate that declarative points have distinct motor profiles (Cochet et al. 2014), involve different brain activity patterns (Committeri et al. 2015), and follow different ontogenetic pathways (Tomasello, 2008, 2019) compared with imperative ones. Furthermore, declarative, but not imperative points are predictive of language proficiency (Colonessi et al., 2010), theory of mind skills (Camaioni et al. 2004; Tomasello 1995) and of an absence of autism spectrum disorder (ASD; see Camaioni et al., 2003). Although these differences between imperative and declarative gestures have been known for a while, this knowledge has not led to recognition that only some uses of pointing gestures are coupled with joint attention.

1.2 Joint Attention Is Human-Unique

Today, we know that joint attention is a behavior that is routinely and reliably shown only by humans. Some animal researchers, most prominently David Leavens and Kim Bard (e.g., Leavens 2011; Leavens & Bard 2011) have tried to argue the opposite and to deliver—mostly anecdotal—evidence that non-human apes are capable of joint attention. They claim that captive apes show and hold up objects for shared attention to human caregivers. An example given by Leavens & Racine (2009) is one in which “Gua”, a chimpanzee raised and studied by the Kelloggs (Kellogg & Kellogg 1933), allegedly produced a declarative gesture by pointing to his nose in response to his caregiver’s request to “Show me your nose”. Apes who point have typically undergone an extensive enculturation process (see Moore, 2013), but there are a few reports of apes using indicative gestures in the wild. Pika & Mitani (2006) observed that adult male chimpanzees from the Ngogo group in Uganda often perform a so-called “directed scratch” during grooming: the ape scratches a particular location on its body to indicate to its addressee, the groomer, where it wants to be scratched. Although impressive, these cases seem to be imperative gestures or responses to such gestures by others. The Gua example can be interpreted as an associative response to another’s request to show the referent of the label “nose”. Similarly, the apes that direct their groomer’s activity to a specific locus on their bodies request an instrumental activity. In neither case need we assume that signaler and addressee engaged in attention sharing.

So far, no non-human animal has unequivocally proven to satisfy the communicative condition that we identified as key in joint attention. Apes don’t seem to use eye-to-eye contact or equivalent means to enter a meeting of minds (see Kano et al., 2012, for indirect support of this claim; see also Kano et al., 2015, for species differences between apes, with bonobos looking to faces more than chimpanzees). The reason why enculturated, rather than wild apes’ social responsiveness might resemble joint attention is that they experience a species-atypical ontogeny: one in which a cooperative (human) partner draws their attention to objects, offers them valued resources (food, etc.), and rewards their effort to coordinate their attention with the human. It is only within this culturally framed environment that apes show joint attention-like behaviors with some regularity (Tomasello & Call, 2004). But even if they are enculturated, there is no convincing evidence to show that apes use or respond to gestures to share experiences with others.

Social phenomena that come closer to the “real deal” of joint attention can be observed in encounters between humans and dogs. Here is a personal anecdote that, I am sure, resembles encounters with dogs many others have had. I took a walk with a friend and her fetching dog. The dog and I played ball. I’d throw the ball and the dog would collect it and drop it on my foot so I would throw it again. After a while, I got sick of the muddy ball soiling my boots, so I stopped picking it up. I acted like the recalcitrant partner in Warneken and colleagues’ (2006) study on the cooperative skills of infants: after engaging in a cooperatively structured activity of throw-and-fetch, I ceased to play my part. Similar to the infants’ reaction in that study, the dog tried to reanimate me, dropping the ball repeatedly on my boots, making noises of impatience and looking up to my face as if to plead or check my face for signs that I was giving in to its bids.

As with human-reared apes, the cultural environment in which dogs are raised is key: from the beginning of their ontogeny, dogs interact with humans who look at, pet, provide food, and engage in play activities with them.Footnote 2 But in addition to this ontogenetic background, dogs, unlike apes, have also been selectively bred to be friendly and socially attuned to humans (Hare et al. 2002; Hare & Tomasello, 2005). In contrast to enculturated apes then, dogs are not only ontogenetically but also phylogenetically prepared to engage in human social interaction, with the effect that human-dog encounters involving objects can have the flavor of joint attention. More research with dogs interacting with humans is needed to be able to decide with greater certainty whether these interactions only resemble or might qualify as genuine acts of joint attention.

1.3 Joint Attention Is The Birthplace of Perspectival Knowledge and A “Theory of Mind”

We now know that joint attention is the birthplace of a great many social-cognitive abilities that children come to develop in their toddler and preschool years. Elsewhere, I argued that joint attention is not “packed” with social cognition—as the ascriptive model of joint attention mentioned above suggests—but instead is the birthplace of infants’ sensitivity to others’ mental lives (Moll & Meltzoff, 2011). It is within joint attention that toddlers come to consider alternative perspectives and develop what has (controversially) been called a “theory of mind”: the ability to discern what others believe, know, want, etc. Longitudinal studies that tracked the development from infancy to childhood have revealed positive associations between children’s joint attentional skills at age 1 and their theory-of-mind capacities two to three years later. Charman & colleagues (2000) found in a small sample (n = 13) of typically-developing children that those who tended to shift their gaze between an agent and an object in episodes of joint attention at 20 months, scored higher on tests from a theory of mind battery (including, e.g., the classic change-of-location task) at 44 months than those who rarely engaged in triadic joint attention in infancy.

In a study with a larger sample and longer longitudinal reach, Sodian & Kristen-Antonow (2015) confirmed the relation between joint attention and theory of mind. Infants who, at 12 months, pointed to objects with the motive to share attention (declarative pointing), showed greater understanding of false beliefs at 50 months than did children with less-developed joint attentional skills at 1 year old. The predictive relation was shown to hold after influences of gender and language abilities were accounted for. A number of other longitudinal projects with children with ASD have furthermore revealed that joint attentional capacities in infancy or the preschool years is positively related to language proficiency and to the child’s social skills during the school years (Mundy et al. 1990; Sigman et al. 1999; Stone & Yoder 2001).

Experiments in which joint attention was directly manipulated also suggest that sharing experiences is key for infants to get a grip on the minds of others—confirming Tomasello’s (1995) claim that joint attention is the bedrock of an understanding of other minds. In a series of experiments with children in the second year of life, my colleagues and I set out to discern under what social conditions infants are able to track what other agents have and have not witnessed or experienced (Moll & Tomasello 2007; Moll et al. 2007, 2008). The experiments relied on a test procedure by Tomasello & Haberl (2003), who had found that 1-year-olds distinguish between what other agents do and do not know—with “know” signifying “being familiar or acquainted with” something (in the way one knows places, people, or objects), not endorsing a true proposition. In that study, 1-year-olds and an adult played together with an adult with two novel objects in turn. The adult then left the room and the infants explored a third novel object together with a research assistant. When moments later, the adult who had been absent for the third toy, returned, looked at the cluster of three toys, and ambiguously and excitedly requested from the infant “that one right there”, the infants significantly selected the toy that was new for the adult—suggesting that they tracked what their interaction partner was and was not familiar with. Expanding on this finding, we varied the conditions under which the mutually known objects (toys 1 and 2) were explored, and we observed that joint attentional engagement is key for infants’ ability to register other agents as being familiar with objects. More specifically, when 14-month-old shared attention with the adult to toy 1 and toy 2, they were later able to determine that the adult requested the third, unshared, object. However, when the infants explored toys 1 and 2 individually, with the adult in the role of an unengaged onlooker, infants failed to disambiguate the adult’s request (Moll & Tomasello 2007).Footnote 3 The same was true in another experimental condition in which children were not active participants but third-party observers of a joint attentional scene between two other agents (Moll et al. 2007).

This group of experiments testifies that participation in joint attention is an entry gate into other persons’ minds. By joining others in attention, infants become aware of what others are and are not experiencing. Importantly, the effect is not reducible to differences in the amount of visual attention paid to the co-attender. In the condition in which the child onlooked as the adult individually engaged with the toys, infants looked much longer (about 30s) at the adult interacting with the toys than they did when they jointly attended to the toys with the adult, in which case they only spent a small fraction (approx. 3 to 5s) of the one-minute-long episode of joint attention looking to the adult’s face. This captures the fact that the child’s co-attender is not to be thought of as second object, with the child dividing her attention between the thing and the other person. Brief moments of eye contact are sufficient to ensure that the intervals between these looks are part of one extended episode of joint attention.

As infants get older, they outgrow their strict dependence on joint attention for determining what others have and have not experienced. In Moll & Tomasello’s (2007) study, infants of 18 months of age were equally able to discern which of the toys was new for the adult when they had onlooked as the adult engaged individually with the toys, as when they had shared these toys with her in joint attentional engagement.

Overall, these experiments demonstrate the primacy of second-personal relations in the young child’s life. Eilan (in press) and others have convincingly demonstrated that partners in joint attention necessarily stand in a second-personal relation: one in which they “adopt an attitude of mutual address” (ibid., p. 7). As she makes clear, there is a mutual dependency, such that I can only be in a you-relation to you if you stand in the same relation to me. Initially, human children learn about others, their minds, and experiences strictly within the cocoon of joint attention. As they approach their second birthday, children become better able to understand the actions and minds of third persons, i.e., of agents with whom they are not currently performing joint actions or sharing attention to objects but with whom they could potentially engage in such a relation of mutual address.

1.4 Joint Attention and Mental Well-Being: Autism Spectrum Disorder (ASD) and Williams Syndrome (WS)

How important it is to build the capacity for joint attention by around 1 year is also revealed by neurodevelopmental conditions in which this ability is notably impacted: autism spectrum disorder (ASD) and Williams Syndrome (WS).

It has long been known that many of the social-relational problems associated with autism begin with impaired joint attention. Retrospective (post-hoc parental reports and video analyses of past patterns of social interaction) and prospective (screening) measures have revealed that the first tell-tale sign of autism is infrequent participation in joint attention (e.g., Charman, 2003). Mundy and colleagues (e.g., Mundy & Newell 2007) famously distinguish between initiating joint attention (IJA), such as when a child points to or holds up an object for another person, and responding to bids for joint attention (RJA)Footnote 4, which is typically measured by the child’s shifting her head orientation toward where another is looking (gaze following). The authors state that these two modes of joint attention rely on separate cognitive processes, and that autism is mainly a disorder of IJA, not RJA. But note that the operationalization of RJA as shifting one’s focus to the target of another’s visual attention, does not meet our criteria for joint attention. As we observed earlier, joint attention is more than one person attending to the same thing another is attending to. Although gaze following might turn into joint attention, it is by no means guaranteed that gazer and gaze follower wind up in joint attention: the key question is whether their eyes interlock or some other form of social contact-making follows after they each focused on the target. For this reason, I am skeptical of the idea that there are fundamentally different kinds of joint attention, only one of which (IJA) is markedly impaired in autism. The dissociation that Mundy and colleagues believe to have found most likely reflects the fact that holding up, showing, and gesturing toward things is more often done with a motivation to share attention than is gaze or point following. Our hunch is that children with ASD find it similarly difficult to make their way into joint attention by responding to another’s bid for it as they do by inviting others to share attention with them.

Setting these incongruencies aside, it is widely agreed that autism shows us how foundational joint attention is for healthy development. Longitudinal studies comparing autistic samples and samples with typically-development children have confirmed that low engagement in joint attention in infancy has negative cascading effects, including delayed language abilities, a reduced capacity for effective social interactions, and limited theory of mind skills (Charman et al. 2000; Mundy et al. 1990; Sodian & Kristen-Antonow 2015).

We ought to remind ourselves that joint attention is not reducible to a desire for social contact—which brings us to another, more rare, neurodevelopmental disorder in which joint attention is also impaired: Williams Syndrome (WS). Unlike ASD, for which no genetic marker has been detected, WS is known to result from a specific genetic abnormality (the deletion of 25 or so genes on Chromosome 7, see Järvinen-Pasley et al. 2008). WS provides a contrast foil to autism in that individuals with this disorder tend to overindulge in social interaction (Jones et al. 2000; Järvinen-Pasley et al., 2013). Children with WS take delight in eye-to-eye contact and dialog. They thrive on the back-and-forth of second-personal interaction and go to great lengths to make and maintain social contact. They enjoy interacting with others so much that they might neglect the difference between friends and strangers and share similar information with both or seek their proximity to similar degrees (Doyle et al. 2004; Riby et al. 2017).

With this brief introduction to WS’s social phenotype, one might think that joint attention comes easily to those afflicted by it because they are keen to elicit social exchange. But this view is shortsighted as it overlooks the world-orientation that is part and parcel of joint attention. In their engagements with others, children with WS tend to prioritize dyadic interaction but express relatively little interest in or curiosity about the shared physical world. Although their language skills have been described as relatively intact or even impressive compared to their challenges in other domains (e.g., spatial cognition, see Gray et al. 2006), children with WS strongly lean toward using language to establish “phatic communion” (Malinowski 1936). In Karl Bühler’s (1934) terms, they draw on language for its “appealing” (or conative) function, and less for its “representational” function. Fitting this description, research has shown that individuals with WS have similar difficulties with triadic joint attention as do those with autism (Vivanti et al. 2017). What differs between the two groups is why they struggle with triangulation. Risking oversimplification, we might say that persons with ASD find it hard to share the world as they experience it with other persons (by closing the triangle of joint attention), whereas those with WS, on the flipside, tend to satiate their drive for intersubjectivity in dyadic encounters instead of jointly opening up to the world. As a study by Liszkowski et al. (2004) nicely shows, neurotypical infants strike a balance between the two extremes. When pointing out an interesting event for others, they tend to be neither satisfied when the addressee only concentrates on the object that they pointed out for her, nor are they content when the addressee devotes all of her attention to just them. What they instead expect, as shown by persistent points if things turn out differently, is for their addressee to identify the referent and then “bring the case home” by looking back to the child to share the experience.

2 What Is Not Yet Known or Poorly Understood

In the remainder of this paper, I will briefly discuss what I take to be important issues of joint attention that have not been resolved. There are, of course, further unresolved issues (such as the role joint attention plays in language acquisition), but I have mainly wondered about two sets of problems or questions, and I think that future research on joint attention would be well-invested by taking a closer look at these problems. The first concerns the relationship in which triadic joint attention stands to the earlier-emerging capacity for “primary intersubjectivity” (e.g., Trevarthen & Aitken 2001), which lets infants, soon after birth, make social contact with others of their kind. Although triadic joint attention goes beyond the dyadic exchange of emotional displays in primary intersubjectivity in important ways—because it forms the basis for the idea of an objective world and of agents’ subjective perspectives on that world—the key, interpersonal, ingredient of joint attention that has been at the center of this article seems to already be in place in early infancy. The second set of questions pertains to the evolution of joint attention: very few scholars have speculated on the phylogeny of this unique skill, and their proposals are quite different from one another. Finding evidence in support of either of these or alternative origin stories would help us to better understand how humans’ unique sociality came into being.

2.1 The Relation Between Primary Intersubjectivity and Joint Attention

In a recent paper, my colleagues and I tried to show that the development of joint attention has antecedents that might be no less foundational and transformative than joint attention itself is (Moll et al. 2021). Shared intentionality, which provided the conceptual framework of many of the studies reported in this paper, famously considers joint attention a major anthropological difference-maker and first landmark of human-unique cognition and sociality (e.g., Tomasello et al. 2005). In this view, joint attention emerges rather abruptly at 9 to 12 months, when infants are said to first engage in genuinely intersubjective relations. The emergence of triadic joint attention has, in the shared intentionality account, been likened to a revolution, because the child now, for the first time, views others as persons like herself: individuals who have attentional states that can be directed and shared (e.g., through pointing and exchanged looks).

A problem is that shared intentionality theory seems to downplay what happens prior to emergence of joint attention. By the age of only two months, infants already step into a dialog with others in which they exchange affective expressions. These young infants maintain eye contact and coo and smile at others in what are known as episodes of primary intersubjectivity (Trevarthen & Aitken 2001). Although these exchanges are strictly dyadic, with no external object “entre nous” that mediates our mutual engagement, it is not clear why these exchanges should not, as the shared intentionality thesis states, be intersubjective: There are two subjects who, for all we know, experience themselves as separate individuals rather than as an undifferentiated bundle (otherwise, why would they express themselves to each other?) and who encounter one another in face-to-face, I-thou, dialogs. There seems to be no good reason to assume that this is not already intersubjectivity (see Moll et al. 2021, for more).

What might be implied by this is that the 9-month-old who starts to share her interest in the world with others, is not quite the revolutionary that shared intentionality makes her out to be. Instead, this infant seems to be continuing a journey that was species-unique from the start. This is suggested by the fact that infants’ behavior in primary intersubjectivity already looks different from what has been observed in other apes. As mentioned above, it is fairly uncommon for apes to be face-to-face and make extended eye contact in dialogical fashion (Kano et al., 2012). A unique form of sociality, with a pronounced drive to meet other minds in second-personal relations is present from the beginning of human life and seems absent in any other kind of life.

The shared intentionality thesis has tried to distance joint attention from its earlier, dyadic, analog not just conceptually but also temporally, by assuming that joint attention emerges suddenly as the child nears her first birthday. But this, too, needs reconsideration. Primary intersubjectivity seems to pave the way for a more gradual emergence of the capacity for triangulation, with some research suggesting that joint attention is already starting to take shape at the half-year mark. In a longitudinal study following infants from 5 to 9 months, Striano & Bertin (2005) found that many infants between 5 and 7 months showed joint attentional looks to their interaction partners. These looks increased over time and, by 9 months of age, were often accompanied by smiles. Other work by Striano and colleagues (Striano, 2004; Striano & Stahl, 2005) suggests that by 6 months, infants prefer to interact with adults who encourage joint attention by shifting their gaze between the infant and an object over adults who exclusively have their focus on them. Longitudinal studies by Adamson and colleagues likewise suggest continuous changes in the form, complexity, and frequency of joint attention (Adamson et al. 2004; Adamson et al. 2014), in which parents play a greater supportive function early in the process and symbols are increasingly infused in the shared activities. These findings point to an earlier and more continuous development of triadic joint attention than has been suggested.

More evidence that a budding capacity for joint attention is present before the 9-month-mark comes from a recent longitudinal study with 6- to 10-months-olds (Salter & Carpenter, submitted [in press]). The authors set up a test situation in which interesting sights and sounds emerged in bursts behind an experimenter’s back but in front of the infant. In this scenario, infants as young as 6 months made active attempts to engage the adult in joint attention, as shown by sharing looks and smiles (but not pointing gestures, which do not emerge until a few months later). Sharing behaviors became more numerous with age, but increases were not significant for any consecutive months, again suggesting a gradual emergence of joint attention across the second half of the first year of life.

What this suggests is that human-unique sociality does not suddenly pop up in mid-infancy in the form of joint attention. Rather, joint attention can be regarded as an extension, albeit a significant one, in which an object of shared interest gets incorporated into an infant-adult dialog that has been alive and well since shortly after birth. Questions that remain to be addressed pertain to the process by which infants move from dyadic to triadic interactions. There is also the big, lingering question of whether experiences can only be shared if there is some third object over which the two parties unite as one body or unit, or whether the dialogical back-and-forth of affective displays in primary intersubjectivity qualifies as the sharing of an experience—rather than just the mutual expression of individual experiences. Empirical and conceptual work remains to be done to illuminate humans’ first steps toward joint attention, before one-year-olds skillfully and routinely triangulate through gestures and words.

2.2 The Evolution of Joint Attention

Complementing the ontogenetic story of when and how joint attention emerges, Tomasello and colleagues proposed an evolutionary narrative which they call the “interdependence hypothesis” (Tomasello et al. 2012; Tomasello & Gonzalez-Cabrera 2017). The hypothesis claims that there were two big evolutionary events whereby modern humans’ ability for shared intentionality came into being. Following the assumption that ontogeny mirrors phylogeny (Tomasello 2019, p. 8), the two postulated evolutionary steps map onto major milestones in the early ontogeny of social cognition. Because joint attention, in this conception, is viewed as the first ontogenetic manifestation of shared intentionality, its evolutionary emergence likewise marks the first step toward modern humans’ sociality. Roughly, the theory states that around 400KYA, when homo heidelbergensis emerged, ecological conditions made it increasingly difficult to individually secure food resources and instead demanded techniques of collaborative foraging. For hominins to survive, they had to become able to form joint goals (e.g., killing a mammoth) and collaboratively act toward these goals from their respective, interrelated roles and perspectives (e.g., I attack from this side and you from the other side). This represents the dual-level infrastructure of joint attention, with the sharing of the target object at the top level (there is “our” mammoth) and the differentiation into distinct roles and perspectives (me from here and you from over there) at the ground level. There was variation in individuals’ propensity to engage in these sorts of joint attentional and cooperative endeavors, and those with more advanced skills became selected.

The second step in the evolutionary emergence of shared intentionality is less relevant here, because joint attention is in full bloom by the time modern homo enters the scene and rises to a new level of togetherness. Suffice it to say that this step is said to have occurred around 150KYA, when ecologically driven selection pressures forced homo sapiens sapiens to unite into larger bands and form cohesive cultural groups, with members who acknowledged and preserved the in-group/out-group difference. Psychologically, a “we-mindset” with a new sense of collective belonging was born. Ontogenetically, this second step is said to map onto 3-year-olds’ newly gained understanding of themselves as members of larger social entities or groups with which they identify and whose norms they begin to enforce (Tomasello 2019).

This origin story of joint attention has been challenged by anthropologists. Taking a life history approach, Hawkes (2014) argued that if joint attention really emerged due to selection pressures on adults and their modes of sustenance, then it is mysterious why joint attention develops at an age when hunting activities are arguably still in the distant future. Hawkes critiques that the interdependence hypothesis cannot explain why humans’ abilities for joint attention shape up in infancy and not, as the proposal should instead predict, in youth or emerging adulthood. Because of this mismatch, it is questionable, Hawkes argues, that joint attention was the evolutionary solution to problems related to foraging.

But the story that Hawkes (2014) herself, following Hrdy (2007, 2011), offers, might also not entirely convince—at least not those whose explanandum is triadic joint attention specifically, rather than the evolution of human social-cognitive or social-emotional competence more generally. The authors defend the cooperative breeding hypothesis, according to which early hominins, before they developed big brains or language, started to rely on a child-rearing technique that splits the burden of childcare between the mother and other adults from the group (alloparenting). Cooperative breeding allows mothers, among other things, to maintain short intervals between the births of her offspring. As a result, hominin infants in the pleistocene faced new selection pressures, as they needed to compete with other infants for the limited attention and care of potential alloparents. The evolutionary solution to this problem was to speed up infants’ social-cognitive and social-emotional maturation. To ensure nonmaternal care, infants had to captivate others’ attention and become skilled at reading their intentions. This seemed necessary because infants found themselves passed back and forth between different persons, each with a mind of their own: hence, the infant mind-reader was born.

The cooperative breeding hypothesis, it seems, cannot straightforwardly explain why joint attention, in particular, is an important skill for infants to have. Hawkes (2014) and Hrdy (2011) both emphasize the adaptive value of being able to mindread as an infant who is cooperatively bred by parents and alloparents. But this ability develops later in ontogeny than does joint attention: numerous studies have shown that the development of a theory of mind lags several years behind the capacity for joint attention e.g., Moll et al. 2022). The cooperative breeding story thus does not seem to explain the emergence of object-directed joint attention, specifically. It also is not obvious why it should be relevant to be able to mindread to secure others’ affection and care as a cooperatively bred infant. More essential traits than both triadic joint attention and mindreading—and therefore more suitable explananda for the cooperative breeding hypothesis—seem to be infants’ “babyness” features (“Kindchenschema”) described by Lorenz (1943) or their capacity to draw other agents to them by cooing, smiling, and holding eye-to-eye contact in primary intersubjectivity.

In response to Hawkes’ (2014) critique, Tomasello & Gonzalez-Cabrera (2017) proffered a composite account of the evolution of shared intentionality that incorporates ideas from both the interdependence and the cooperative breeding hypotheses. This composite account takes a perspective of evolutionary developmental biology by acknowledging that evolutionary changes are, to a large extent, changes to a species’ ontogenetic processes, including changes to the onset and pace with which traits emerge. The authors argue that adaptations for shared intentionality that were beneficial for cooperative breeding emerged in infancy, and adaptations that benefitted collaborative foraging arose in youth. Ultimately, however, infantile adaptations “migrated up” and juvenile ones “migrated down” the ontogenetic ladder because the skills that these adaptations supported proved useful at developmental stages other than those they originally addressed. Toddlers, for example, engage in cooperative play with shared goals and differentiated roles that resemble the division of labor of later hunting techniques, and juveniles and adults expand on the mindreading skills they have developed in infancy and toddlerhood.

What makes it difficult to agree or disagree with these origin stories is the lack of a good measure against which they could be scientifically evaluated. Plausibility seems to be their main constraint, but obviously, narratives can be plausible or believable without being true. (And there is so little established knowledge to clash with in evolutionary storytelling that the space for “wild guesses” is wide-open.) By no means do I discourage narrative construction about how we became a species with such an extraordinary sociality. But what is needed is a reflection on the fact that when we switch from human ontogeny, which can be and has been observed and studied time and time again, to (past) phylogeny, we seem to cross the boundary of scientific inquiry and enter the vast field of (scientifically informed) speculation.