Elsevier

Cognition

Volume 108, Issue 1, July 2008, Pages 155-184
Cognition

Does language guide event perception? Evidence from eye movements

https://doi.org/10.1016/j.cognition.2008.02.007Get rights and content

Abstract

Languages differ in how they encode motion. When describing bounded motion, English speakers typically use verbs that convey information about manner (e.g., slide, skip, walk) rather than path (e.g., approach, ascend), whereas Greek speakers do the opposite. We investigated whether this strong cross-language difference influences how people allocate attention during motion perception. We compared eye movements from Greek and English speakers as they viewed motion events while (a) preparing verbal descriptions or (b) memorizing the events. During the verbal description task, speakers’ eyes rapidly focused on the event components typically encoded in their native language, generating significant cross-language differences even during the first second of motion onset. However, when freely inspecting ongoing events, as in the memorization task, people allocated attention similarly regardless of the language they speak. Differences between language groups arose only after the motion stopped, such that participants spontaneously studied those aspects of the scene that their language does not routinely encode in verbs. These findings offer a novel perspective on the relation between language and perceptual/cognitive processes. They indicate that attention allocation during event perception is not affected by the perceiver’s native language; effects of language arise only when linguistic forms are recruited to achieve the task, such as when committing facts to memory.

Introduction

How do humans talk about the visual world? In an obvious sense, what we talk about is limited by constraints on how we see the world, including basic biases affecting how we conceptualize objects and events. Most theories of cognition and language assume that core aspects of the human perceptual and cognitive machinery are universal: given innate maturational properties of the human brain and typical experiential input, our perception and conception of objects and events is expected to be largely the same across individuals regardless of the language (or languages) learned during childhood. Under this view, these core systems generate nonlinguistic event and object representations that are shared by members of different linguistic communities and form the starting point for the generation of event and object descriptions in language (Gleitman and Papafragou, 2005, Jackendoff, 1996, Levelt, 1989, Miller and Johnson-Laird, 1976, Pinker, 1989).

Despite the broad appeal of this position, the transition from event conceptualization (the nonlinguistic apprehension of the main aspects of an event) to sentence planning (the mobilization of structural/lexical resources for event encoding) to speech execution has remained a mystery, largely because the particulars of nonlinguistic event representations have been notoriously hard to specify (Bock, Irwin, & Davidson, 2004; cf. Jackendoff, 1996, Lashley, 1951, Paul, 1970, Wundt, 1970).1 It is only recently that researchers have begun to get an experimental foothold into exploring the relationship between event apprehension and event description, with these findings showing a surprisingly tight temporal coupling between these two interrelated processes. In the first experiment to explore the temporal interface between language production and event comprehension, Griffin and Bock (2000) recorded speaker’s direction of gaze as they visually inspected and described static line drawings of simple actions (e.g., a picture of a girl spraying a boy with a garden hose) that could be described with either an active or a passive sentence. Analysis of the eye movements in relation to active/passive linguistic choices led to the conclusion that there exists an initial rapid-event/gist extraction stage (event apprehension) that is temporally dissociable from any linguistic planning stage. However, further eye-tracking studies using picture description tasks have shown that these apprehension and linguistic formulation processes overlap temporally to a considerable extent: initial shifts of attention to event participants predict which participant will be mentioned first in the sentence (Gleitman, January, Nappa, & Trueswell, 2007). Taken together, these and related findings show that preparing to speak has rapid differential effects on how people allocate visual attention to components of a scene: if people need to talk about what they see, they immediately focus on aspects of scenes which are relevant for purposes of sentence planning (for recent reviews, see Altmann and Kamide, 2004, Bock et al., 2004, Griffin, 2001, Meyer and Lethaus, 2004).

An important (but little studied) aspect of the interface between event apprehension and language production lies with the fact that there are considerable cross-linguistic differences in how fundamental event components are segmented and packaged into sentences (e.g., Talmy, 1985, Talmy, 1991). This cross-linguistic variation raises two interrelated questions. First, how do speakers of different languages manage to attend to different aspects of the visual world and integrate them into linguistic structures as they plan their speech? Given the within-language evidence sketched above, it is reasonable to expect that language-specific semantic and syntactic requirements create different pressures on the online allocation of attention as speakers of different languages plan their speech. Accordingly, some commentators have proposed that such language-specific demands on the formulation of messages become automatized (at least in adult speakers) and shape the preparation of utterances even before the activation of specific lexical items (Bock, 1995, Levelt, 1989). This idea, otherwise known as ‘thinking for speaking’ (Slobin, 1996, Slobin, 2003), can be summarized as follows: Languages differ in a number of dimensions, most importantly in the kind of semantic and syntactic distinctions they encode (e.g., morphological markers of number, gender, tense, causality, and so on). Mature (i.e., adult) speakers know by experience whether their language requires such categories or not, and select the appropriate information in building linguistic messages. Thus language-specific requirements on semantic structure become represented in the procedural knowledge base of the mechanism responsible for formulating event representations for verbal communication (see Levelt, 1989). This view predicts that there should be early effects on attention allocation in speakers of different languages as they prepare to describe an event.

A second issue raised by cross-linguistic variation is whether core aspects of event apprehension itself could be substantially shaped by the properties of one’s native language even in contexts where no explicit communication is involved. According to this stronger hypothesis, the rapid mobilization of linguistic resources for language production (‘thinking for speaking’) may affect other, nonlinguistic processes which interface with language, such as perception, attention and memory. For instance, of the many ways of construing a scene, those that are relevant for linguistic encoding may become ‘privileged’ both in terms of online attention allocation and for later memory and categorization (Whorf, 1956; and for recent incarnations, Gumperz and Levinson, 1996, Levinson, 1996). Unlike the universalist view outlined earlier, according to which conceptual/perceptual and linguistic representations are distinct and dissociable, this linguistic relativity position assumes that these levels of representation are essentially isomorphic (Pederson et al., 1998). Language, by virtue of being continuously used throughout one’s life, can thus come to affect an individual’s ‘habitual patterns of thought’ (Whorf, 1956) by channeling the individual’s attention towards certain distinctions and away from others. Ultimately, ‘[t]he need to output language coded in specific semantic parameters can force a deep-seated specialization of mind’ (Levinson, 2003, p. 291) – and linguistically encoded semantic-conceptual distinctions might create striking (and permanent) cognitive asymmetries in speakers of different languages (for further discussion, see Levinson, 1996, Majid et al., 2004; cf. also Boroditsky, 2001, Lucy, 1992, McDonough et al., 2003).

These cross-linguistic issues raise two different ways in which language might guide attention during event perception – the first more modest, the second deeper and more controversial.2 Nevertheless, both these hypotheses have proven hard to evaluate: Very little is known yet about how sentence generation interacts with event apprehension in speakers of different languages, and the workings of ‘thinking for speaking’ have not been demonstrated cross-linguistically. Furthermore, cross-linguistic work on the linguistic relativity hypothesis has focused on off-line cognitive tasks and has not adequately addressed potential effects of language-specific structures on on-line event apprehension (see the reviews in Bowerman and Levinson, 2001, Gentner and Goldin-Meadow, 2003). In both cases, since event apprehension happens very quickly (Griffin and Bock, 2000, Dobel et al., 2007), studying how event perception and language make contact cross-linguistically requires methodologies that track how perceivers extract linguistic and cognitive representations from dynamic events as these events unfold rapidly over time.

Here we pursue a novel approach to cross-linguistic event conceptualization using an eye-tracking paradigm. Specifically, we investigate the language–thought relationship by monitoring eye movements to event components by speakers of different languages. Since eye fixations approximate the allocation of attention under normal viewing conditions, people’s eye movements during event apprehension offer a unique window onto both the nature of the underlying event representations and the link between those representations and language. Our approach is twofold: first, we look at how speakers of different languages visually inspect dynamic events as they prepare to describe them. Under these conditions, cross-linguistic differences should impact the online assembly of event representations if ‘thinking for speaking’ affects sentence generation cross-linguistically. Second, we study shifts in attention to event components during a nonlinguistic (memory) task to test whether nonlinguistic processes of event apprehension differ in speakers of different languages in accordance with their linguistic-encoding biases.

Our empirical focus is events of motion (e.g., sailing, flying). We chose motion events for two main reasons. First, motion scenes are concrete, readily observable and easily manipulated and tested; second, the expression of motion is characterized by considerable cross-linguistic variability. This variability includes asymmetries in what languages choose to encode (e.g., semantic information about the components of a motion event), but mostly in how languages encode this information (e.g., in syntactic configurations vs. the lexicon; in verbs vs. adverbial modifiers), and how often the information is used (whether it is an obligatory grammatical category, a robust typological tendency, or an infrequent occurrence). Our focus is on a well-documented difference in both how and how often motion paths and manners are expressed across languages (Section 1.1) that has been linked to potential effects on nonlinguistic event processing (Section 1.2).

All languages typically encode the path, or trajectory (e.g., reaching/leaving a point), and the manner of motion (e.g., skating, flying), but differ systematically in the way path and manner are conflated inside sentences. Manner languages (e.g., English, German, Russian, and Mandarin Chinese) typically code manner in the verb (cf. English skip, run, hop, jog), and path in a variety of other devices such as particles (out), adpositions (into the room), or verb prefixes (e.g., German raus- ‘out’; cf. raus-rennen ‘run out’). Path languages (e.g., Modern Greek, Romance, Turkish, Japanese, and Hebrew) typically code path in the verb (cf. Greek vjeno ‘exit’, beno ‘enter’, ftano ‘arrive/reach’, aneveno ‘ascend’, diashizo ‘cross’), and manner in gerunds, adpositions, or adverbials (trehontas ‘running’, me ta podia ‘on foot’, grigora ‘quickly’).3 The Manner/Path distinction is not meant to imply that the relevant languages lack certain kinds of verbs altogether (in fact, there is evidence that all languages possess manner verbs, though not path verbs; Beavers et al., 2004). But the most characteristic (i.e., colloquial, frequent, and pervasive) way of describing motion in these languages involves manner and path verbs, respectively.

The Manner/Path asymmetry in verb use is made more salient by the following compositionality restriction: while in Manner languages manner verbs seem to compose freely with different kinds of path modifiers, in many Path languages manner verbs, with some exceptions, do not appear with resultative phrases to denote bounded, culminated motion (e.g., Aske, 1989, Cummins, 1998; Slobin and Hoiting, 1994, Stringer, 2003). Thus the compact way of expressing motion in the English example in (1) is not generally available in Greek; the PP in (2) can only have a locative (not a resultative/directional) interpretation:

The roots of this constraint lie in the morphosyntax of telicity: Greek disallows nonverbal resultatives, or ‘secondary predication’, with only a few exceptions (for discussion, see Folli and Ramchand, 2001, Horrocks and Stavrou, 2007, Napoli, 1992, Snyder, 2001, Washio, 1997). In order to convey the bounded event in (1), Greek needs to either switch to a path verb and optionally encode manner in a modifier such as a present participle as in (3a), or break down the event into two separate clauses with a path and manner verb as in (3b). Since both options in (3) are dispreferred/‘marked’, the default way of encoding this event would be (3a) without the manner participle:

In effect, then, the resultative frame constraint intensifies the Manner/Path asymmetry in the use of motion verbs across Manner and Path languages. Notice that this constraint does not affect the use of manner of motion verbs for descriptions of simple, unbounded events (The bird flew/To puli petakse are fine in English and Greek, respectively).

The typological differences outlined in the previous section affect how speakers habitually talk about motion. These differences are already in place as early as 3 years of age in Path vs. Manner languages (Allen et al., 2007, Slobin, 1996, Slobin, 2003; cf. Naigles et al., 1998, Papafragou et al., 2002, Papafragou et al., 2006) and affect conjectures about the meanings of novel verbs in both children and adults (Naigles and Terrazas, 1998, Papafragou and Selimis, 2007). These lexicalization patterns have been hypothesized to be a powerful source of ‘thinking for speaking’ effects, since they represent different ways in which languages allocate their grammatical/lexical resources to a common semantic domain (Slobin, 1996). On this view, the specialization of the event conceptualization phase for language-specific properties can be triggered not only by obligatory grammatical categories (Levelt, 1989) but also by systematic typological patterns in mapping semantics onto lexico-syntactic classes.

There are independent reasons to assume that verb encoding biases, in particular, might exercise especially powerful pressures on message preparation: verbs determine the argument structure of a sentence so they can be particularly central to the process of preparing event descriptions. Thus, even though the same kind of semantic information is conveyed in the English example in (1) and its Greek equivalent that includes a manner modifier in (3a) (i.e., both languages refer to path and manner), the distribution of information into morphosyntactic units should drive differences in early allocation to manner when speakers of the two languages prepare to describe the relevant motion event (with English speakers attending to manner earlier for purposes of selecting an appropriate verb). Notice that the crucial difference lies not only in the order of mention of path and manner information, but also in the placement of the information inside or outside the sentential verb.

Could such lexicalization preferences ‘percolate’ into conceptual event encoding more broadly? Some commentators have taken this stronger position. For instance, Slobin has proposed that manner in speakers of Manner languages is a “salient and differentiated conceptual field” compared to speakers of Path languages, with potential implications for how manner of motion is perceived/attended to online and remembered (Slobin, 2004; cf. Bowerman & Choi, 2003). In a related context, Gentner and Boroditksy (2001, p. 247) have noted that “verbs… – including those concerned with spatial relations – provide framing structures for the encoding of events and experience; hence a linguistic effect on these categories could reasonably be expected to have cognitive consequences”. These views fit with a more general expectation that linguistic encoding categories are imposed upon incoming perceptual spatial information so that speakers might later describe spatial configurations using the resources available in their native language (Gumperz and Levinson, 1996, Levinson, 1996).

Some earlier studies have cast doubt on these relativistic claims. Papafragou et al. (2002) examined motion event encoding in English- and Greek-speaking adults and young children through elicited production tasks and compared these results to memory for path/manner or use of path/manner in event categorization. Even though speakers of the two languages exhibited an asymmetry in encoding manner and path information in their verbal descriptions, they did not differ from each other in terms of classification or memory for path and manner (cf. also Papafragou, Massey, & Gleitman, 2006). Similar results have been obtained for Spanish vs. English (see Gennari, Sloman, Malt, & Fitch, 2002; cf. Munnich, Landau, & Dosher, 2001). However, one might object that tasks and measures employed in these studies bring together many known subcomponents of cognition (perception, memory encoding, memory recall) and thus may fail to adequately identify possible linguistically-influenced processes. According to the attention-based hypothesis we are considering, linguistic encoding preferences might affect the earliest moments of event perception. Specifically, the online allocation of attention while perceiving an event might be shaped by those aspects of the event that are typically encoded in the observer’s native language, especially in verbs. This preference might emerge irrespective of other (possibly language-independent) biases in processing rapidly unfolding events. It seems crucial therefore to explore the time course of spatial attention allocation during event perception in both linguistic and nonlinguistic tasks.

To test whether cross-linguistic differences affect the way language users direct attention to manner of motion, we recorded eye movements from native Greek and English speakers as they watched unfolding motion events (both bounded and unbounded) while performing either a linguistic or a nonlinguistic task. Specifically, trials consisted of watching a three second video of an unfolding event, which then froze on the last frame of the video. At that time, participants had to either describe the event or study the image further for a later memory test. It is well known that the allocation of attention during scene perception depends upon which aspects of the scene are deemed important to achieve the task (e.g., Triesh, Ballard, Hayhoe, & Sullivan, 2003). Given this, we carefully constructed our animated visual stimuli so that linguistically relevant manner and path information could be easily defined as distinct regions spatially separated from each other. If cross-linguistic differences affect the allocation of attention during event perception generally, we would expect English speakers to be more likely than Greek speakers to focus on manner information early and consistently in both the linguistic and nonlinguistic task. But if event perception is independent of language, we should see differences between English and Greek speakers only in the linguistic task (and only in bounded events, where the two languages differ in terms of the information typically encoded in the verb): such differences in attention allocation should be consistent with preparatory processes for sentence generation (‘thinking for speaking’). During tasks that do not require description of the event (as in the nonlinguistic task), Greek and English speakers should behave similarly in terms of attention allocation.

Finally, it should be noted that a third viable outcome exists. During the dynamic unfolding of each event, it is possible that no differences between English and Greek speakers will be observed, even in the linguistic task – i.e., even when participants know they will have to describe the movie when it stops. The characteristic effects of sentence planning on eye movement patterns might only occur just prior to speech production. Such an outcome would be inconsistent with the thinking-for-speaking hypothesis and would be most consistent with a strong universalist view. Prior studies of event description have found very early effects of linguistic-preparation on eye movements (Gleitman et al., 2007, Griffin and Bock, 2000), even when participants had to wait a fixed amount of time before describing the event (Griffin & Bock, 2000). However, these past studies used static pictures of events rather than dynamic displays. It is possible that the demands of interpreting an unfolding event in real-time preclude the attentional prioritization of language-relevant event components, i.e., thinking-for-speaking.

Section snippets

Participants

Seventeen native English speakers and 17 native Greek speakers participated in the experiment. The English speakers were Psychology undergraduates at the University of Pennsylvania and received course credit for participation. The Greek speakers were students or junior faculty at various Universities in the Philadelphia area and were paid $8 for participation. Data from three additional participants showed severe track loss in the eye-tracking records and were discarded.

Stimuli

Test items consisted of

General discussion

The experiment reported in this paper introduced and explored a novel tool for investigating the language–thought interface by cross-linguistically analyzing participants’ eye movements as they inspected ongoing dynamic events. Our findings support the conclusion that preparing for language production has rapid differential effects on how people allocate visual attention to components of a scene: if people need to talk about what they see, they are likely to shift their focus of attention

Acknowledgments

This work was supported by an International Research Award and a UDRF Award from the University of Delaware to A.P., and a grant from the National Institutes of Health (1-R01-HD37507) to J.T.

References (66)

  • E. Munnich et al.

    Spatial language and spatial representation: A cross-linguistic comparison

    Cognition

    (2001)
  • A. Papafragou et al.

    Shake, rattle, ‘n’ roll: The representation of motion in thought and language

    Cognition

    (2002)
  • A. Papafragou et al.

    When English proposes what Greek presupposes: The linguistic encoding of motion events

    Cognition

    (2006)
  • J. Trueswell et al.

    The kindergarten-path effect

    Cognition

    (1999)
  • G. Altmann et al.

    Now you see it, now you don’t: Mediating the mapping between language and the visual world

  • Aske, J. (1989). Path predicates in English and Spanish: A closer look. In Proceedings of the fifteenth annual meeting...
  • W. Baddeley

    Working memory and language: An overview

    Journal of Communication Disorders

    (2003)
  • Beavers, J., Levin, B., & Wei, T. (2004). A morphosyntactic basis for variation in the encoding of motion events. Paper...
  • J.K. Bock

    Sentence production: From mind to mouth

  • K. Bock et al.

    Putting first things first

  • M. Bowerman et al.

    Space under construction: Language-specific spatial categorization in first language acquisition

  • R. Conrad

    Acoustic confusion in immediate memory

    British Journal of Psychology

    (1964)
  • S. Cummins

    Le mouvement directionnel dans une perspective d’analyse monosemique

    Langues et Linguistique

    (1998)
  • G. Dell et al.

    Lexical access on aphasics and nonaphasic speakers

    Psychological Review

    (1997)
  • R. Folli et al.

    Getting results: Motion constructions in Italian and Scottish Gaelic

  • M. Garrett

    Processes in language production

  • D. Gentner et al.

    Individuation, relativity and early word learning

  • L. Gleitman et al.

    Language and thought

  • Z. Griffin et al.

    What the eyes say about speaking

    Psychological Science

    (2000)
  • Gumperz, J., & Levinson, S. (Eds.). (1996). Rethinking linguistic relativity. New York: Cambridge University...
  • R. Jackendoff

    The architecture of the linguistic–spatial interface

  • Cited by (0)

    View full text