Integrating conceptual knowledge within and across representational modalities
Introduction
Semantic memory contains a great deal of knowledge regarding lexical concepts such as dog and banana, and as such is important for language processing, perception, reasoning, and action. Concepts referring to living and non-living things include information such as how something looks, tastes, feels, and sounds, and how it is used. The manner in which this knowledge is represented and organized greatly impacts behavior. It is intuitive to think of this conceptual knowledge in terms of features. For example, how a typical dog looks or sounds can be described by features such as 〈has legs〉, 〈has a tail〉, 〈has a nose〉, 〈has ears〉, 〈barks〉, and so on. Although some models of semantic memory are not based on feature representations – for example, Latent Semantic Analysis (Landauer & Dumais, 1997) – feature-based models, which describe concepts as collections of features at some level of abstraction, dominate the literature.
The manner in which types of featural knowledge are neurally organized and integrated differentiates semantic memory models. Many of the remarkable capabilities of the human conceptual system are attributable to a large and highly interconnected network of processing units. As is elaborated below, although activity in one processing unit or region may eventually influence that of many others, it often does so indirectly. The fact that connectivity patterns determine the speed and/or strength of signal propagation between units has a number of important behavioral consequences. Therefore, the organization of conceptual representations directly influences cognitive processing because it determines the manner in which subsystems influence one another, and the temporal dynamics of such influences.
The goal of the present research is to provide the first direct test of two broad, central assumptions that have been made concerning how the brain organizes and uses types of knowledge. We test their behavioral consequences in tasks that are sensitive to the temporal dynamics of semantic processing in a distributed multimodal representational system. These assumptions concern whether modality-specific concepts are integrated using (1) direct connections or a single convergence zone (semantic hub), versus (2) a deeper hierarchy of convergence zones. A convergence zone is a neural region that binds information (Damasio, 1989a, Damasio, 1989b). Convergence zones can, for example, bind features from a single modality into combinations (as in co-occurring clusters of visual parts) or bind features from multiple modalities (as in clusters of visual parts and correlated functions). Convergence zones can also bind information from lower-level convergence zones (and thus form a hierarchy).
Experiments 1 and 2 use feature relatedness judgments to tap participants’ knowledge of relations for within-modal (〈has two wheels〉 〈has handle bars〉) and cross-modal (〈used by riding〉 〈has handle bars〉) feature pairs. Experiments 3 and 4 use a dual-feature verification task with either within-modal (〈has pockets〉 〈has sleeves〉 coat) or cross-modal feature pairs (〈worn for warmth〉 〈has sleeves〉 coat). We found that within-modal feature relatedness latencies are shorter than cross-modal ones, and verification latencies are shorter given features from two modalities rather than one. These results favour models in which distributed modality-specific conceptual representations are bound together using a deep hierarchy of convergence zones.
To gain insight into the knowledge underlying people’s concepts, researchers use tasks in which participants list features such as 〈has four legs〉, 〈has fur〉, 〈has a tail〉, and 〈barks〉 for concepts like dog. Such features have been useful in accounting for a range of behaviors, from similarity judgments (Tversky, 1977) to theory generation (Ahn et al., 2002, McNorgan et al., 2007). Although features like 〈is man’s best friend〉 for dog reflect encyclopaedic-like knowledge, perhaps acquired linguistically, many features are learned by directly experiencing concepts’ referents through the senses. For example, one sees that a dog has four legs, hears that it barks, and feels that it is covered in fur. Thus, many features are strongly associated with particular senses. Feature production norms, therefore, can be used to provide insight into the salience and amount of knowledge that people possess for each sensory modality with respect to individual concepts.
It has long been known that some brain regions are specialized for perception in specific sensory modalities. The question of representational modality concerns the extent to which conceptual organization is tied to perceptual organization. That is, given that perception across the senses is distributed, at least in part, across specialized brain regions, it is possible that people’s conceptual representations are organized similarly. One way to contrast models is to partition them into amodal versus multimodal theories. Although various amodal theories make different assumptions with respect to what information is stored in semantic memory, all assume that objects or their features are represented in a single homogenous store. For amodal models, the sensory modality through which knowledge is gained is irrelevant to the representation of that knowledge because this information is lost when it is transduced into mental symbol systems. In contrast, multimodal theories posit that concepts are distributed across a wide network of brain areas, and that a concept’s features are tied to sensory modalities.
The issue of whether the human conceptual system is multimodal or amodal remains under debate. However, the bulk of recent evidence from a number of lines of research favours the multimodal account. The literature regarding patients with category-specific semantic deficits has long been used to argue for multimodal representations. Warrington and McCarthy’s (1987) sensory/functional theory accounts for patterns of category specific impairments of knowledge in patients who have suffered focal or diffuse brain damage, under the assumption that living things and artifacts differentially depend on visual and functional information – an assumption that has been supported and extended by analyses of feature norms (Cree and McRae, 2003, Garrard et al., 2001), and by a number of functional neuroimaging (see Martin, 2007, Martin and Chao, 2001 for reviews) and ERP experiments (Sitnikova, West, Kuperberg, & Holcomb, 2006).
The imaging literature also provides a wealth of evidence extending the sensory/functional theory that supports a distributed multimodal representational system. Goldberg, Perfetti, and Schneider (2006a) used fMRI to tie together previously reported neuroimaging evidence supporting modally bound tactile, colour (Martin et al., 1995, Mummery et al., 1998), auditory (Kellenbach, Brett, & Patterson, 2001), and gustatory representations (Goldberg, Perfetti, & Schneider, 2006b). Goldberg et al. (2006a) found that sensory brain areas for each modality were recruited during a feature verification task that used linguistic stimuli (e.g., banana–yellow). These results indicate that the semantic representations activated from linguistic stimuli are modally distributed across brain regions. In summary, a number of complimentary techniques provide converging evidence supporting a distributed multimodal representational system.
Though concepts may be distributed neurally across a wide network, our mental experiences of them are not a jumble of features, disjointed across space and time, but instead they are experienced as coherent unified objects. Any model that uses distributed feature representations must account for what is sometimes called the binding problem: How are representational elements integrated into conceptual wholes? Similarly, how are we able to infer one feature from the presence of another, such as the likelihood that something flies if it has feathers? If one makes the additional assumption that semantic representations are modally distributed, the binding problem becomes further complicated because it raises the question of whether within-modal binding is accomplished differently than cross-modal binding, or differs by modality. Understanding how distributed representations are integrated into conceptual wholes is therefore of central importance to evaluating semantic memory models and understanding brain function.
One solution to the binding problem involves temporal synchrony between firing rates of neurons (von der Malsburg, 1981, von der Malsburg, 1999). Object representations may be derived from the coincidental firing rates of distributed neural populations, bound together by virtue of firing at a particular rate. The dominant competing solution to the binding problem, described in some detail by Damasio (1989a), relies on the convergence zone, defined as a “record of the combinatorial arrangements [of feature-based sensory or motor activity]” (p. 26). A convergence zone can be thought of as a collection of processing units that receive input from and encode coincidental activity among multiple input units. In connectionist terms, a convergence zone may be likened to a hidden layer (Sejnowski, Kienker, & Hinton, 1986). Because they encode time-locked activation patterns, an important property of convergence zones is that they transform, rather than simply repeat signals, with one consequence being that convergence zones encapsulate information. In this way, successive convergence zones (or iterative feedback through individual convergence zones) may gradually build more complex or abstract representations. Naturally, these binding mechanisms are not mutually exclusive, and it is possible that various systems rely to different degrees on either or both, as might be suggested by evidence supporting both binding mechanisms (Treisman, 1996). We focus on convergence zones because it is clear that the organization of these regions plays an important role in multimodal semantic processing (Patterson et al., 2007, Simmons and Barsalou, 2003) and because we have strong predictions how this organization should influence within- and cross-modal semantic processing.
If the multimodal conceptual system is built atop the highly interconnected perceptual system, one might reasonably assume that a similar pattern of connectivity has developed in the semantic system, and that the same neural regions that serve as sensory convergence zones also act as representational convergence zones. This need not be the case, however. For example, although the perceptual and conceptual systems may share some of the same pathways, there may be practical reasons for two functionally independent systems of convergence zones to have emerged (this may prevent synaesthetic experiences, or prevent top-down processing from inducing hallucinations). Moreover, although there may be processes common to perceptual and conceptual binding, they do differ in important ways. Segregating objects from the background or accommodating partially occluded objects are problems for object parsing in perceptual binding that do not seem to apply (or perhaps apply only analogously) in conceptual binding. Thus, even if there was a consensus about how the cognitive system solves the perceptual binding problem (and there is not), the manner in which multimodal semantic integration occurs would remain an open question.
A number of multimodal semantic theories have been proposed in the last two decades, and each makes somewhat different assumptions about the modalities that are represented and the relationships among them. These models can be broadly grouped into two classes, deep and shallow, based on the assumed hierarchy of convergence zones. Differences in assumed connectivity lead to untested predictions for how modally distributed information is integrated. Thus, tasks that are sensitive to the time course of integration of featural information either within or across modalities constrain models of semantic representation.
We use hierarchical depth to describe models with respect to the number and configuration of convergence zones, ranging from the shallowest models with no convergence zones to arbitrarily deep models. Although hierarchical depth is a continuous dimension, an interesting distinction can be made between shallow models that assume zero or one convergence zone, versus deeper models with multiple convergence zones.
Hierarchically shallow models refer to those in which all semantic integration occurs in the same location. Modally segregated representational stores pass information to one another either through direct connections (and thus lack any convergence zones, as in Fig. 1), or through a single convergence zone that integrates information from all representational modalities (Fig. 1). A number of proponents of multimodal semantic representations have put forward shallow hierarchy models. Farah and McClelland’s (1991) implementation of Warrington and McCarthy’s (1987) sensory/functional theory, depicted in Fig. 1, and the attractor network used in Cree, McNorgan, and McRae’s (2006) investigation of the roles played by distinguishing and shared features use direct interconnections between processing units, and do not include any distance assumptions. Examples of models employing a single convergence zone include the attractor network described in Cree, McNorgan, and McRae’s (2006) simulations of semantic priming effects, Humphreys and Forde’s (2001) Hierarchical Interactive Theory (HIT) model, and Patterson et al.’s (2007) semantic hub model (however, see Section 6 for possible alternative conceptions of Patterson et al.).
Hierarchically deep models are those for which connective distance differs. Initial convergence zones generally integrate information from nearby representational units for a single modality, whereas others with successively larger receptive fields integrate multimodal information from more distant brain areas. In the most clearly hierarchical models, this information is passed forward from earlier convergence zones (Damasio, 1989a, Damasio, 1989b, Simmons and Barsalou, 2003).
Plaut (2002) implemented a computational model that uses a single hidden layer to make within- and cross-modal connections, and is therefore, strictly speaking, a shallow model. There is, however, an influence of connective distance: features within a modality are connected by relatively short proximal connections, whereas those across modalities are connected by relatively long distal connections. Given that neurons predominantly form short connections (Jacobs & Jordan, 1992), Plaut stated that the “literal implementation of [the model] is implausible” (p. 626). Thus, connective distance in Plaut’s model corresponds to differing numbers of interposing processing units, which leads to one of two assumptions. The first is that longer connections comprise chains of signal repeaters; that is, each unit in the chain simply passes along an unmodified signal. However, neurons typically receive multiple connections, which in turn allows them to integrate, perform computations on, and modify signals. Thus, the second (and we believe more plausible) assumption is that increasing physical distance introduces additional integrative units, which, in essence, corresponds to introducing convergence zones. For this reason, we classify Plaut’s model as quasi-hierarchical.
In amodal models, because information is not functionally segregated by sensorimotor modality, convergence zones are not strictly required. One possibility is that associations among an object’s features are coded via direct connections reinforced through statistical learning (Tyler & Moss, 2001). Note that the absence of modality information does not preclude convergence zones. Concept names, for example, could be assumed to encapsulate features of the concepts they represent, thereby integrating featural information. However, regardless of whether a particular amodal model includes convergence zones, the distinction between within-modal and cross-modal feature pairs would not influence the tasks used in the present experiments because factors like correlational strength between features were equated in our experiments.
A number of theoretical considerations favour a shallow integration hierarchy. Multimodal semantic models have been criticized as lacking parsimony (Riddoch, Humphreys, Coltheart, & Funnell, 1988), therefore models specifying multiple hierarchical convergence zones would seem to be even less parsimonious. Furthermore, many semantic phenomena have been simulated using networks lacking convergence zones (Cree et al., 2006, Farah and McClelland, 1991), implying that a deep hierarchy of convergence zones may not be necessary. Patterson et al. (2007) contend that generalized impairments that accompany semantic dementia are best explained by a single semantic hub that integrates information from all modalities, and Rogers et al. (2004) present a computational implementation of this idea that simulates a number of aspects of behavioral phenomena exhibited by semantic dementia patients.
On the flip side, there are anatomical constraints that seem to suggest a hierarchically deep organization. First, the volume of the human skull precludes the degree of connectivity required for the shallowest models that lack any convergence zones (Plaut, 2002). Bidirectional communication within a bank of n processing units requires the order of n2 direct connections but just n(log n) connections to higher-order integrating units capable of pattern separation. One might reasonably assume that the savings in the number of connections would favour a hierarchical organization. Second, candidate brain regions for a single convergence zone should have reciprocal projections to all modalities, and ablation of such an area should preclude any sort of multimodal conceptualization. Damasio (1989a) argues that the only such region is the hippocampus, and because bilateral ablation of this structure does not lead to a catastrophic loss of the ability to conceptualize, it is unlikely that semantic integration occurs within a single convergence zone. On the other hand, one could argue that the sort of generalized impairments that accompany semantic dementia constitute a progressive breakdown of the conceptual system. Because this disease is accompanied by degeneration of anterior temporal lobes, Patterson et al. (2007) argue that this region is the locus for a single semantic hub. Third, the arrangement of cells, such as in visual cortex, into functionally distinct layers with progressively larger receptive fields may occur elsewhere in the brain, including those supporting conceptual processing, and would implement the sort of deep hierarchy suggested by Damasio, 1989a, Damasio, 1989b, Simmons and Barsalou, 2003.
The preceding discussion highlights a number of arguments favouring each of the two major assumptions regarding hierarchical organization. Both assumptions have been incorporated in models that have been used to explain a number of behavioral phenomena. Furthermore, the literature that speaks to the brain’s connectivity, which would appear to be the best source of insight into constraints on the brain’s ability to integrate information, does little to resolve the matter. A number of brain regions, including perirhinal cortex (Bussey, Saksida, & Murray, 2002), anterior temporal cortex (Patterson et al., 2007), frontal and prefrontal cortex (Fuster et al., 2000, Green et al., 2006), and left inferotemporal cortex (Damasio, Tranel, Grabowski, Adolphs, & Damasio, 2004) have been put forward as critical structures for learning relationships among features from multiple modalities. However, it is unclear whether these areas represent a network of regions that act as a single convergence zone in a shallow system, or a hierarchy of convergence zones.
In summary, there is a sizeable literature that strongly supports the idea that semantic integration is accomplished across multiple brain regions. Because neural connectivity patterns appear to be consistent with both shallow and deep models, the existing literature is far from conclusive regarding the manner and location(s) of semantic integration.
The physical relationships among modality-specific representational areas and their convergence zones are assumed to influence the time course of information integration. Neurally proximal areas should generally communicate with one another in less time than distal areas. The aspect of proximity that we explore concerns the fact that neurons may (and commonly do) communicate with one another indirectly. Thus, rather than measure the spatial distance between two neurons, one might instead count the number of synapses that connect them. Because the transmission of information at the synapse is not instantaneous, communication between a directly connected pair of neurons may be faster than between an indirectly connected but physically closer pair of neurons. When thinking about distance in terms of number of connections, one might also describe communication time between two processing units in terms of processing steps, with each step reflecting a unit of time required for communication across a synapse. Because shallow and deep integration hierarchies differ in this regard, it is clear that a model’s integration hierarchy should influence the predicted time course of semantic processing. Moreover, as explained below, this influence may differ by task.
We make two critical assumptions, neither of which favours either shallow or deep-hierarchy accounts. We first assume that reading a feature name such as 〈has a blade〉 activates the underlying modality-specific representation. It has repeatedly been demonstrated in the imaging literature that feature names such as 〈has a blade〉 activate their underlying modality-specific neural representation (Chao & Martin, 1999; Goldberg et al., 2006a, Simmons et al., 2005, Simmons et al., 2007). One might argue that, for example, reading a feature name activates the corresponding modality-specific representation only after a delay. Thus, all stimuli in our experiments, because they are presented as words, would belong to the same modality, such as an amodal or ‘lexical’ modality, depending on the position one takes. Importantly, in the experiments reported herein, if this was correct, null effects would be predicted.
We further assume that activity spreads outward via neural connections and promotes activation of other representations, from which further activation spreads outwards, and so forth. In this way, if a convergence zone forms a path between representational stores, then verbally presented features can effectively prime subsequently presented ones, either individually as in feature-to-feature inferences (inferring that if something has wheels, it is used by riding), or entire clusters of features as in feature-to-concept activation (classifying a small animal as a skunk on the basis of its size, shape, colouration, and gait). The manner in which features are integrated should thus influence integration-based decision latencies. To investigate and constrain semantic memory models, the present research uses feature-to-feature (Experiments 1 and 2) and feature-to-concept judgments (Experiments 3 and 4) on correlated feature pairs from a large set of feature production norms (McRae, Cree, Seidenberg, & McNorgan, 2005).
Section snippets
Experiment 1
Experiment 1 used within- and cross-modal feature pairs in a speeded relatedness decision task to test the role that modality plays in feature inference. Relatedness decisions are a fairly transparent measure of people’s knowledge of the relations between object features, and therefore tap the types of processes used during feature inference (McNorgan et al., 2007). Feature inference involves determining the probability of whether some feature B exists for an object, given that the object is
Experiment 2
Experiment 2 replicated Experiment 1 using a more rigorously controlled set of items, and a modified presentation paradigm. We expected to find the same results, which supported hierarchically deep models.
Experiment 3
People possess a rich knowledge of many objects that spans multiple representational modalities. Accordingly, when identifying an object, people generally do so using only a subset of the information they possess about it, and yet they are able to retrieve other knowledge about it. For the sake of brevity, we use the term, “concept activation” as a short form for activating concepts given one or more features. One way to think about this process is in terms of pattern completion. In a system
Experiment 4
As explained above, deep-hierarchy models predict modality effects only for tasks for which processing speed should be the primary determinant of performance. With additional processing time, overall levels of semantic activation for both modality conditions should approach the same levels and furthermore allow participants to employ strategic processing. Experiment 3 used a 1000 ms SOA between the first and second feature, and between the second feature and the target concept to ensure that
General discussion
The present research used complementary behavioral tasks to investigate the neural architecture underlying integration of multimodal semantic representations. In Experiments 1 and 2, using a feature relatedness task, we found a within-modal advantage that is predicted by hierarchically deep models. However, necessary aspects of these experiments’ design yielded open issues. The first concern was that functional features, which may take longer to retrieve, appear only in the slower cross-modal
Conclusion
In the present research, we used complementary behavioral tasks to test assumptions regarding the neural architecture of semantic memory. Our studies provide clear evidence for the existence of a deep hierarchy in a multimodal distributed semantic memory system.
Acknowledgements
This work was supported by Natural Sciences and Engineering Research Council Discovery Grant 0155704, and National Institute of Health Grant HD053136 to KM.
References (55)
Feature dominance and typicality effects in feature statement verification
Journal of Verbal Learning and Verbal Behavior
(1978)Time-locked multiregional retroactivation: A systems-level proposal for the neural substrates of recall and recognition
Cognition
(1989)- et al.
Neural systems behind word and concept retrieval
Cognition
(2004) - et al.
Orientation dependence in the recognition of familiar and novel views of three-dimensional objects
Vision Research
(1992) - et al.
Frontopolar cortex mediates abstract integration in analogy
Brain Research
(2006) - et al.
Somatotopic representation of action words in human motor and premotor cortex
Neuron
(2004) - et al.
The sensory–motor specificity of taxonomic and thematic conceptual relations: A behavioural and fMRI study
Neuroimage
(2009) - et al.
Semantic memory and the brain: Structure and processes
Current Opinion in Neurobiology
(2001) - et al.
Learning symmetry groups with hidden units: Beyond the perceptron
Physica
(1986) - et al.
A common neural substrate for perceiving and knowing about color
Neuropsychologia
(2007)