Active and passive scene recognition across views
Introduction
Real-world object and scene recognition faces a fundamental problem: the retinal projection of the environment changes whenever the observer or objects in the environment move. Changes to the relative positions of the observer and objects can lead to size and orientation changes in the retinal projection of the environment. Yet our visual system somehow finds stability in these changing images. Two distinct approaches to achieving stability across view changes have been proposed in the literature. The system may selectively encode features of the scene that are invariant to perspective changes and use those features in object and scene recognition. For example, we may represent the object-centered spatial relationships among the parts of an object. Alternatively, our system may employ some transformation rules to compensate for changes in the retinal projection, thereby providing a common basis for comparing two different views. For example, we may mentally rotate an object until it is aligned with a previous representation or we could interpolate between two or more views to recognize objects from new perspectives.
Research on object recognition across views has provided some support for each of these possibilities. For example, Biederman and colleagues (Ellis et al., 1989; Biederman and Cooper, 1991, Biederman and Cooper, 1992; Cooper et al., 1992; see also Bartram, 1976) used a priming paradigm and measured the response latency to name line drawings of familiar objects. In their studies, the amount of priming was unaffected by changes in the retinal size of the object from study to test (scaling invariance). Furthermore, naming latency was impervious to changes to the position of the object in the visual field and to the object's orientation in depth. Biederman and Gerhardstein (1993)showed similar orientation invariance when observers were asked to match individual shapes (Geons), name familiar objects, and classify unfamiliar objects.
In contrast, many other studies suggest that object recognition performance is view-dependent; recognition accuracy and latency differ as the test views deviate from the studied view (e.g. Shepard and Metzler, 1971; Shepard and Cooper, 1982; Rock et al., 1989). With wire-frame or blob-like objects in same-different judgment tasks (Bülthoff and Edelman, 1992; Tarr, 1995; Tarr et al., 1997), subjects typically show fast, accurate recognition for test views within a small distance of the studied view and impaired performance for novel views. Furthermore, the impairment seems to be systematically related to the magnitude of the difference between studied and tested views, particularly for changes to the in-depth orientation of an object. The greater the rotation in depth away from the studied view, the longer the response latency (see also Tarr and Pinker, 1989). Such findings necessarily imply that object representations are viewer-centered.
Another critical piece of evidence in support of viewer-centered representations is that when two or more views of the same object are provided at study, subjects subsequently generalize to intermediate views but not to other views (Bülthoff and Edelman, 1992; Kourtzi and Shiffrar, 1997). A number of models of object recognition have attempted to account for this finding by positing mechanisms that operate on viewer-centered representations. For example, linear combinations of 2D views (Ullman and Basri, 1991) and view approximation (Poggio and Edelman, 1990; Vetter et al., 1995) are both consistent with these data. However, in order to interpolate between two or more views, the initial views must first be linked to the same object. That is, subjects must recognize that the same object is being viewed in the first and second studied views even though those views differ. It is unclear from these models how this initial matching is accomplished, particularly if the views are relatively far apart and the object is not symmetrical (see Vetter and Poggio, 1994). Although these models may not fully account for the nature of object recognition for novel views, the empirical data seems to support the claim that representations of individual objects are view-dependent.
Both view-independent and view-dependent models of object recognition seem to capture some aspects of how the visual system accommodates view changes. For example, when the learning period is relatively long and the object is relatively complicated and difficult to name, recognition may rely on viewer-centered representations. On the other hand, when objects are made of distinct parts whose spatial relationship can be coded easily, and when the task concerns more abstract knowledge such as naming or classification, recognition may rely on view-independent representations. Nevertheless, studies comparing these models typically test recognition for isolated objects and they ignore extra-retinal information that is available in real-world object recognition. Thus, neither model is likely to explain all aspects of object representation.
Section snippets
Recognition of object arrays
Recently, several laboratories have begun to consider the recognition of more complex, naturalistic displays (e.g. spatial layouts of objects) across views. Spatial layout representations are important for a number of reasons. First, most real-world object recognition occurs in the context of other objects rather than in isolation. Thus, it seems reasonable to study spatial layout representations to gain a clearer picture of the sorts of representations we might need from one view to the next.
A hint from spatial reasoning studies
Although studies of spatial layout recognition are closer to real-world recognition, most have neglected an important source of information that may be central to real-world object and scene recognition. In real environments, observers have available many sources of information in addition to the retinal projection of the scene. For example, they have visual, vestibular, and proprioceptive information for their own movements. Such extra-retinal information may specify the magnitude of a change
Scene recognition in real world
Despite evidence that imagined observer and display rotations lead to differences in performance, only recently has work in object and scene recognition considered this difference. Studies of object recognition have relied exclusively on display rotations to study view changes. This neglect of observer movement can be traced to the assumption that equivalent retinal projection changes should produce equivalent mental transformations of the visual representation. Because the retinal projection
Mechanisms of updating
Studies of navigation have shown that extra-retinal information can be used in updating one's own position. Spatial representations of position and orientation rely on vestibular signals (e.g. Israel et al., 1996), proprioceptive and kinesthetic cues (e.g. Loomis et al., 1993; Berthoz et al., 1995), optical flow (Ronacher and Wehner, 1995; Srinivasan et al., 1996), magnetic fields (Frier et al., 1996), and energy expenditure (Kirchner and Braun, 1994). By using one or more of these sources of
Experiment 1
This experiment served as a replication of earlier work comparing orientation and viewpoint changes (Simons and Wang, 1998), and tested the possibility that the availability of additional visual information would allow updating during orientation changes. Observers viewed layouts of real objects on a rotating table and were asked to detect changes to the position of one of the objects. We examined performance on this task across both shifts in the observer viewing position and rotations of the
Method
The apparatus was the same as in Experiment 1. Eleven undergraduates participated in the study in exchange for $7 compensation. Unlike Experiment 1, observers remained at the same viewing position for all 40 trials of this experiment. On each trial, they viewed the array for 3 s (Study period) and then lowered the curtain. During the 7 s delay interval, the table rotated by 40 degrees. For half of the trials, the experimenter rotated the table (as in Experiment 1) and for the other half, the
Experiment 3
In this experiment, observers sat on a wheeled chair and were rolled by an experimenter from the Study position to the Test position. If updating of the viewer-centered representation requires active control over the viewpoint change, observers should be less accurate when they are passively moved. By comparing performance in this experiment to the corresponding active-movement condition in Experiment 1, we can access the effect of active movement on the updating process.
General discussion
When observers remain in the same position throughout a trial, they are better able to detect changes when they receive the same view at study and test. In striking contrast, when observers move to a novel viewing position during a trial, they detect changes more effectively when they receive the corresponding novel view than the studied view. That is, they are better able to detect changes when the orientation of the table is constant throughout a trial, even if that means they will experience
Acknowledgements
The authors contributed equally to this research and authorship order was determined arbitrarily. Thanks to Daniel Tristan and Chris Russell for help collecting the data and to M.J. Wraga for comments on an earlier version of the paper. Some of this research was presented at ARVO 1998.
References (59)
- et al.
Viewer- and object-centered mental explorations of an imagined environment are not equivalent
Cognitive Brain Research
(1997) - et al.
Object recognition and laterality null effects
Neuropsychologia
(1991) - et al.
Effect of context and efference copy on visual straight ahead
Vision Research
(1989) - et al.
Ocular proprioception and efference copy in registering visual direction
Vision Research
(1991) Multiple sources of outflow in processing spatial information
Acta-Psychologica
(1986)- et al.
Acquisition and integration of route knowledge in an unfamiliar neighborhood
Journal of Environmental Psychology
(1990) - et al.
Mental rotation and the perspective problem
Cognitive Psychology
(1973) - et al.
The coding and transformation of spatial information
Cognitive Psychology
(1979) - et al.
Dancing honey bees indicate the location of food sources using path integration rather than cognitive maps
Animal Behaviour
(1994) - et al.
Spatial representations of young children: the role of self- versus adult-directed movement and viewing
Journal of Experimental Child Psychology
(1983)