Elsevier

Cognition

Volume 138, May 2015, Pages 132-147
Cognition

Visual, haptic and bimodal scene perception: Evidence for a unitary representation

https://doi.org/10.1016/j.cognition.2015.01.010Get rights and content

Highlights

  • Visual, haptic and bimodal scene perception all led to boundary extension.

  • Cross-modal tests suggest a common multimodal scene representation.

  • Sensory and top-down inputs are organized within an amodal spatial framework.

Abstract

Participants studied seven meaningful scene-regions bordered by removable boundaries (30 s each). In Experiment 1 (N = 80) participants used visual or haptic exploration and then minutes later, reconstructed boundary position using the same or the alternate modality. Participants in all groups shifted boundary placement outward (boundary extension), but visual study yielded the greater error. Critically, this modality-specific difference in boundary extension transferred without cost in the cross-modal conditions, suggesting a functionally unitary scene representation. In Experiment 2 (N = 20), bimodal study led to boundary extension that did not differ from haptic exploration alone, suggesting that bimodal spatial memory was constrained by the more “conservative” haptic modality. In Experiment 3 (N = 20), as in picture studies, boundary memory was tested 30 s after viewing each scene-region and as with pictures, boundary extension still occurred. Results suggest that scene representation is organized around an amodal spatial core that organizes bottom-up information from multiple modalities in combination with top-down expectations about the surrounding world.

Introduction

Multiple sensory modalities provide the perceiver with rich information about the surrounding world. In spite of this, similar to other areas of perception, research on scene perception has typically been studied through a modality-specific lens (usually vision; Intraub, 2012, O’Regan, 1992). Yet, even when perception is limited to the visual modality alone participants frequently remember seeing the continuation of the scene just beyond the boundaries of the view, in the absence of any corresponding sensory input (boundary extension; Intraub & Richardson, 1989). This can occur very rapidly, across intervals as brief as a saccadic eye movement (Dickinson and Intraub, 2008, Intraub and Dickinson, 2008). Boundary extension may be an adaptive error that facilitates integration of successive views of the world (Hubbard et al., 2010, Intraub, 2010, Intraub, 2012). Indeed, research has shown that boundary extension can prime visual perception of upcoming layout, when that layout is subsequently presented (e.g., Gottesman, 2011).

What leads to this spatial error? Intraub, 2010, Intraub, 2012, Intraub and Dickinson, 2008 suggested that rather than a visual representation, representation of visual scenes is actually a multisource representation in that it incorporates information from both the sensory source (vision) as well as top-down sources of information that place the studied view within a likely surrounding spatial context. Potential top-down sources include amodal continuation of the surface beyond the boundaries (Fantoni, Hilger, Gerbino, & Kellman, 2008), general scene knowledge based upon scene classification (Greene & Oliva, 2009), and object-to-context associations (Bar, 2004). The purpose of our research was to determine if boundary extension following visual or haptic perception of the same scene-region is supported by a single multimodal scene representation or by two functionally independent modality-specific scene representations.

Boundary extension is a spatial error in which a swath of anticipated space just beyond the boundaries of the view is remembered as having been perceived. Neuroimaging and neuropsychological research have shown that boundary extension is associated with neural activation of brain regions thought to play important roles in spatial cognition: the hippocampus, parahippocampal cortex, and retrosplenial complex (Chadwick et al., 2013, Mullally et al., 2012, Park et al., 2007). The hippocampus has long been associated with spatial representation and navigation (Burgess, 2002, Maguire and Mullally, 2013, O’Keefe and Nadel, 1978). The parahippocampal cortex and retrosplenial complex have been associated with perception of spatial layout, and with the integration of local spaces within larger spatial contexts, respectively (Epstein, 2008). Recent research has shown that the parahippocampal cortex responds similarly to visual and haptic perception of layout (Wolbers, Klatzky, Loomis, Wutte, & Giudice, 2011; also see Epstein, 2011), underscoring the spatial rather than modality-centric role of this brain area.

It has been suggested that scene representation is fundamentally an act of spatial cognition (Dickinson and Intraub, 2008, Gagnier et al., 2013, Gagnier and Intraub, 2012, Intraub, 2010, Intraub, 2012, Intraub and Dickinson, 2008). In their multisource model Intraub and Dickinson, 2008, Intraub, 2010, Intraub, 2012 proposed that an amodal spatial structure organizes multiple sources of knowledge (bottom-up and top-down) into a coherent scene representation (see Maguire & Mullally, 2013, for a similar view from the perspective of hippocampal function). The idea is that the observer brings to any view of a scene a sense of surrounding space (the space “in front of”, “to the left and right”, “above”, “below” and “behind” the observer (see Bryant et al., 1992, Franklin and Tversky, 1990, Tversky, 2009). This provides the scaffolding that supports not only the bottom-up visual information but the anticipated continuation of the scene beyond the boundaries of the view. This underlying spatial structure is similar to the “spatial image” proposed by Loomis, Klatzky, and Giudice (2013), in that it is a surrounding spatial representation (not limited to the frontal plane) and that it is amodal. The only difference is that unlike the “spatial image”, the spatial structure in the multisource model is conceptualized as a “standing framework”, rather than one that develops in response to a stimulus in working memory.

Although most research on boundary extension has focused on picture memory, there is evidence that the same anticipatory spatial error occurs following visual perception of real scenes in near space (Hubbard et al., 2010, Intraub, 2002, Intraub, 2010), and following haptic perception of the same scene regions (Intraub, 2004, Mullally et al., 2012). The multisource model provided the same explanation for visual and haptic boundary extension, but included no commitment as to whether they draw on a single scene representation or on distinct modality-specific representations. The evidence for boundary extension in 3D space was based on experiments in which meaningfully related objects were arranged on natural backgrounds (e.g., “kitchen scene”), bounded by a “window frame” to limit visual or haptic exploration.

In haptic studies, blindfolded participants explored the bounded regions right up to edges of the display, and minutes later, after the boundaries were removed, participants reconstructed boundary placement. They set the boundaries outward, including a greater expanse of space that had originally been included in the stimulus. This occurred in spite of the fact that there was always an object 2–3 cm from the boundary, forcing participants to squeeze their hands into a tightly constrained space. As in the case of vision (Gagnier et al., 2013) a seemingly clear marker of boundary placement did not prevent boundary extension. A comparison of boundary extension following visual or haptic exploration of the same regions showed that vision yielded the more expansive error (Intraub, 2004). This was the case whether visual boundary extension was compared to haptic boundary extension in sighted participants who were blindfolded for the experiment, or in a woman who had been deaf and blind since early life (a “haptic expert”).

Why might vision have yielded a greater anticipatory spatial error? Intraub (2004) speculated that such a difference, if reliable, might be related to the different characteristics and spatial scope of the two modalities. Vision is a distal modality with a small high acuity foveal region (about 1° of visual angle) and a large low-acuity periphery. Together these encompass a relatively large spatial area. In contrast, the haptic modality encompasses multiple high acuity regions (the fingertips) and a relatively small periphery. In the case of vision, a greater amount of the visually imagined continuation of the view might be confusable with visual memory for the stimulus than in the case of haptic exploration. This explanation conforms to the notion of boundary extension as a source monitoring error (Intraub, 2010, Intraub, 2012, Intraub et al., 2008, Intraub and Dickinson, 2008, Seamon et al., 2002). According to the source monitoring framework (Johnson, Hashtroudi, & Lindsay, 1993) as the similarity between representations drawn from two different sources (e.g., perception and imagination) increases, so too does the likelihood of source misattributions (as, for example, when a dream is unusually high in detail, and is erroneously misattributed to perception). Other research on boundary extension has shown that factors that would be expected to affect the similarity between memory for the perceived region and memory for the imagined continuation of the view do indeed influence the size of the boundary error (Gagnier and Intraub, 2012, Intraub et al., 2008). An alternative explanation of the difference between the visual and haptic conditions, however, is that it is caused by different biases at test. For example, blindfolded participants may feel more constrained about how far they are comfortable reaching out their hands to designate boundary placement.

In sum, both visual and haptic exploration can result in boundary extension. However, the studies demonstrating this cannot provide insight into whether the scene representation supporting this error is a single multisource scene representation that includes multimodal input, or two separate, modality-specific multisource representations, with multisource referring to the combination of bottom-up and top-down sources of information. Before describing the rationale for our research, we will discuss other spatial tasks in which visual and haptic study have been compared, because they have direct bearing on the current research.

A critical aspect of spatial cognition is that an arrangement of objects in the world can be represented within a variety of reference frames (see Allen, 2004). A well-established observation is that after viewing a display of objects, participants tend to organize memory within an egocentric frame of reference (the objects with respect to the viewer), rather than an allocentric framework (the objects with respect to one another; Diwadkar and McNamara, 1997, Shelton and Mcnamara, 1997, Simons and Wang, 1998, Wang and Simons, 1999). Critical observations supporting this are that costs are incurred when participants either view or imagine the display from an alternate viewpoint (e.g., a viewpoint that is shifted 60° from the original position). In these cases, when the opportunity for spatial updating is eliminated (e.g., as would occur if participants simply walked to the new location) the change in viewpoint reduced participants’ ability to remember the object arrangement correctly. Haptic exploration, unlike vision (with its much larger periphery), does not allow the observer to perceive all the objects at once. Instead the observer must serially explore each object’s relation to other objects individually, raising the possibility that an allocentric representation might be engaged (Newell, Woods, Mernagh, & Bülthoff, 2005). However, analogous research using the haptic modality yielded similar results (Newell et al., 2005, Yamamoto and Shelton, 2005). Spatial memory under these conditions was organized within an egocentric frame of reference when the scene was perceived using haptic input.

In subsequent frame-of-reference research in the visual modality, Mou and McNamara (2002) demonstrated that if the display of objects is arranged so that it has an intrinsic structure (e.g., symmetry), rather than an egocentric reference frame, participants adopt an allocentric frame of reference that is centered on the intrinsic structure of the display. Yamamoto and Philbeck (2013) pointed out that the smaller scope of haptic exploration (compared with vision) might prevent participants from organizing memory around this intrinsic structure (which in haptics would require an object by object search). However, they reported that although the layout of the objects could not be perceived all at once, memory following haptic exploration mirrored that observed in vision. Participants organized their memory around the intrinsic structure of the displays. There is also evidence that environmental information provided by haptics can impact the choice of reference frame in a visual task, supporting the idea of a common reference frame across modalities (Kelly, Avraamides, & Giudice, 2011). Thus, results suggested that vision and haptics both share common biases in terms of reference frame, and both may support computation of the anticipated continuation of the explored region, yielding boundary extension in memory. In addition, other spatial tasks involving visual or haptic exploration of maps or scenes have demonstrated common representational biases (e.g., Giudice et al., 2011, Pasqualotto et al., 2005). These studies suggest that vision and haptics share common biases in terms of reference frame and Intraub’s (2004) research adds to these commonalities, in demonstrating that both modalities yield the same overinclusive anticipatory spatial bias in memory – boundary extension.

Bearing these similarities in spatial biases in mind, we will now return to the question of whether visual information and haptic information are stored in a functionally unitary mental structure or in two functionally distinct, modality-specific representations. This has been a core issue in the field of multisensory perception. Different tasks and different types of stimuli have led to a variety of conclusions about how the sensory systems interact. Some evidence supports a single (multisensory) representation, whereas other evidence such as visual capture (Hay, Pick, & Ikeda, 1965) and auditory capture (Morein-Zamir, Soto-Faraco, & Kingstone, 2003) suggest modality-specific representations that in some cases yield conflicting information about stimuli. Models have been proposed to describe the different combination strategies and mechanisms used to integrate multisensory information into a coherent representation (Ernst & Bülthoff, 2004).

Multisensory studies have focused both on the temporal integration of independent objects (e.g., a visual stimulus and a sound) and on spatial integration (multiple modalities exploring the same object or display). The current research focuses on the spatial representation of meaningful scenes that are perceived visually, haptically (without vision), or bi-modally. To achieve this, stimuli were meaningful, multi-object displays (e.g., a place setting; tools in a workman’s area) in near space directly in front of the participant (peripersonal space, Previc, 1998) that can be readily explored either using the visual or haptic modality.

Newell et al. (2005) addressed the question of a unitary representation vs. modality-specific representations in memory for object arrays in peripersonal space by contrasting recognition memory (detecting that the position of two of seven small wooden objects was swapped) within vision and within haptics vs. across modalities. They provided examples of cases in which different spatial biases were observed in vision and haptics (e.g., in the case of the horizontal-vertical illusion; Avery and Day, 1969, Day and Avery, 1970) and argued that given these differences, it may be that modality-specific representations would be formed in their experiment. They reasoned that this would result in a cost being incurred when recognition memory is tested across modalities because the information in one modality-specific representation would need to be “translated” into the other. On the other hand, if memory for spatially arrayed objects is maintained in a functionally unitary spatial representation, then no cost would be expected. Participants should be able to note differences in the positions of the objects irrespective of whether the modalities at study and test were the same or different. They tested participants’ ability to recognize changes in object position within modality or across modalities, and did so whether the table was in the same position or was shifted 60° while the participant’s view was blocked.

Newell et al. (2005) found no costs associated with cross-modal transfer as a function of viewpoint (same view vs. shifted view), suggesting that the same egocentric representation was supporting both visual and haptic representation (also see Kelly & Avraamides, 2011), but they did find a cost in recognition memory for the objects’ positions as a function of modality. Participants made fewer errors in the within-modality conditions than the cross-modal conditions. Newell et al. argued that recoding spatial position from one representation to the other had incurred the cost. They argued that it was the spatial placement of the objects with respect to one another, rather than the specific details of how the objects themselves were remembered that was driving the difference, although this distinction was not specifically tested. In conclusion, Newell at al. suggested that whereas frame of reference is a unitary representation, specific spatial characteristics within that reference frame may be stored in different modality-specific representations.

We report three experiments in which spatial memory for the expanse of scene-regions composed of meaningfully related objects was examined following visual inspection, haptic exploration or both simultaneously (bi-modal exploration). In Experiment 1, similar to Newell et al. (2005), tests involved either the same modality or the alternative modality (testing cross-modal transfer). In Experiment 2, simultaneous bimodal exploration was used. In Experiment 3, we explored the possibility that boundary extension in peripersonal space requires navigation prior to testing. What is different about our research on spatial memory is that the focus is on false memory beyond the scope of the sensory input. Key questions explored across these three experiments were: (a) Is scene representation a unitary representation that incorporates information from multiple sensory sources (vision and haptics) with related top-down information or is information stored in separate modality-specific representations (one for vision and one for haptics) with each including associated top-down information, (b) If a unitary representation is supported, is there a “blending” of inputs into a code that is devoid of sensory specific characteristics (an amodal code) or does the mental representation retain qualities specifically tied to the individual modalities? And, (c) Can we observe boundary extension in memory for 3D scene regions under conditions that are similar to those that have been observed in memory for 2D scenes (photographs)?

Section snippets

Experiment 1

Participants were assigned to one of four independent groups in which study and test were conducted within-modality (vision–vision, or haptic–haptic) or across-modalities (vision–haptic, or haptic–vision). If memory is stored in a functionally unitary representation, then any modality-specific differences in boundary extension (e.g., greater boundary extension for visual than haptic exploration; Intraub, 2004) should transfer (without cost) in the cross-modal test conditions. We should see an

Experiment 2

In our interactions with the world we obtain information from more than a single modality at a time. In Experiment 2, we sought to determine the effect of bimodal exploration on memory for boundary placement. One possibility is that exploring the regions using simultaneous visual and haptic exploration might eliminate boundary extension. Participants would not only see and feel the objects and background of each scene-region, but would watch their own hands squeezing through the small spaces

Experiment 3

Participants in Experiments 1 and 2, and participants in Intraub (2004), all studied a small set of scene-regions (6–7 regions), and then received the boundary memory test minutes later. They knew that memory would be tested, but not the exact nature of the test. In addition, they navigated from one scene to next and moved through doorways (that they touched as part of the safety protocol of the experiment). In Experiment 3 we sought to determine if boundary extension in real space is similarly

General discussion

Boundary extension occurred in memory for bounded regions of peripersonal space following visual, haptic and bimodal study. The first question we addressed in this series of experiments was whether visual boundary extension and haptic boundary extension draw on a functionally unitary scene representation or on two separate modality-specific scene representations. Cross-modal tests in Experiment 1 suggested a common scene representation. Vision resulted in greater boundary extension than did

Visual, haptic and bimodal boundary extension

In natural scene perception, the perceiver is typically embedded within the scene he or she perceives. For example, in contrast to viewing a picture of a kitchen, one stands in a kitchen with appliances, cabinetry and kitchen furniture surrounding the perceiver. Scenes don’t exist solely in the frontal plane, as in a picture, but surround the perceiver. According to the multisource model, the mental representation of a scene is similarly structured in terms of surrounding space. The surrounding

Conclusion

After studying small regions of scenes in peripersonal space, participants remembered having explored beyond the boundaries of the view whether perception had been visual, haptic or both simultaneously. This suggests that irrespective of modality the representation of scene-regions is a multisource representation that reflects both bottom-up (sensory) and top-down sources of input. Our cross-modal and bimodal conditions suggest a functionally unitary representation in which specific

Acknowledgements

This research was supported in part by a grant from the National Institutes of Health (MH54688) to HI. The authors thank research assistants Terry Penn, Kyrianna Ruddy, and Caroline Murray for their valuable contributions in helping to run these experiments.

References (57)

  • S. Park et al.

    Beyond the edges of a view: Boundary extension in human scene-selective visual cortex

    Neuron

    (2007)
  • T. Wolbers et al.

    Modality-independent coding of spatial layout in the human brain

    Current Biology

    (2011)
  • N. Yamamoto et al.

    Intrinsic frames of reference in haptic spatial learning

    Cognition

    (2013)
  • G.C. Avery et al.

    Basis of the horizontal–vertical illusion

    Journal of Experimental Psychology

    (1969)
  • M. Bar

    Visual objects in context

    Nature Reviews: Neuroscience

    (2004)
  • N. Burgess

    The hippocampus, space, and viewpoints in episodic memory

    The Quarterly Journal of Experimental Psychology

    (2002)
  • R.H. Day et al.

    Absence of the horizontal–vertical illusion in haptic space

    Journal of Experimental Psychology

    (1970)
  • C.A. Dickinson et al.

    Transsaccadic representation of layout: What is the time course of boundary extension?

    Journal of Experimental Psychology: Human Perception Performance

    (2008)
  • V.A. Diwadkar et al.

    Viewpoint dependence in scene recognition

    Psychological Science

    (1997)
  • R.A. Epstein

    Cognitive neuroscience: Scene layout from vision and touch

    Current Biology

    (2011)
  • C. Fantoni et al.

    Surface interpolation and 3D relatability

    Journal of Vision

    (2008)
  • N. Franklin et al.

    Searching imagined environments

    Journal of Experimental Psychology: General

    (1990)
  • Gagnier, K. M. (2010). Rethinking boundary extension: The role of source monitoring in scene memory. Retrieved from...
  • K.M. Gagnier et al.

    Fixating picture boundaries does not eliminate boundary extension: Implications for scene representation

    The Quarterly Journal of Experimental Psychology

    (2013)
  • K.M. Gagnier et al.

    When less is more: Line-drawings lead to greater boundary extension than color photographs

    Visual Cognition

    (2012)
  • N.A. Giudice et al.

    Functional equivalence of spatial images from touch and vision: Evidence from spatial updating in blind and sighted individuals

    Journal of Experimental Psychology: Learning, Memory & Cognition

    (2011)
  • N.A. Giudice et al.

    Evidence for amodal representations after bimodal learning: Integration of haptic–visual layouts into a common spatial image

    Spatial Cognition & Computation

    (2009)
  • Cited by (21)

    • Reduced peripheral vision in glaucoma and boundary extension

      2024, Clinical and Experimental Optometry
    View all citing articles on Scopus
    1

    Address: Dismounted Warrior Branch, Human Research and Engineering Directorate, U.S. Army Research Laboratory, Aberdeen Proving Ground, MD 21005, USA.

    2

    Address: Spatial Intelligence and Learning Center, Department of Psychology, Temple University, Philadelphia, PA 19122, USA.

    View full text