1 Introduction

There are some complex experiences, such as the experiences that allow us to understand linguistic expressions and pictures respectively, which seem to be very similar. For they are stratified experiences in which, on top of grasping certain low-level properties, one also grasps some high-level semantic-like properties. Yet first of all, those similarities notwithstanding, we claim that a phenomenologically-based reflection shows that such experiences are different (§ 1). For a meaning experience has a high-level fold – in which one grasps the relevant expression’s meaning - that is not perceptual, but is only based on a low-level perceptual fold that merely grasps that expression in its visually or acoustically relevant properties (colors and shapes, or sounds, and possibly also its morpho-syntactic organization). While a pictorial experience, what Wollheim (1980, (1987, (1998, 2003a,b) takes to be a seeing-in experience, has two folds, the configurational and the recognitional fold – in which one respectively grasps the physical basis of a picture, its vehicle, and what the picture presents, its subject – that are both perceptual, insofar as they are intimately connected. For unlike a meaning experience, in a seeing-in experience one can perceptually read off the picture’s subject from the picture’s vehicle. Moreover and very interestingly, as we shall claim, this phenomenological difference is neurologically implemented. For not only the cerebral areas that respectively implement such experiences are different, at least as far as the access to those experiences’ respective high-level content is concerned. As is shown by the fact that one can selectively be impaired in the area respectively implementing the meaning vs. the seeing-in experience without losing one’s pictorial vs. semantic competence respectively (§ 2). But also, unlike meaning experiences, the area implementing the seeing-in experiential folds is perceptual as a whole. For not only a picture’s subject can be accessed earlier than an expression’s meaning, but also the neural underpinnings of such folds are located in the perceptual areas of the brain (§§ 2–3). As is inter alia shown by the particular case of one’s competence with ambiguous pictures on the one hand and with ambiguous expressions on the other hand (§ 3).

2 Unlike Seeing-in Experiences, Meaning Experiences are not Proper Fusion Experiences for They Are Not Perceptual, but Only Perceptually-Based

On the one hand, seeing-in experiences, the experiences that for Wollheim (1980, (1987, (1998, 2003a,b) determine what it is for a depiction (an intentionally-based picture such as a painting, a sketch and a drawing, as well as a causally-based picture such as a photo, a movie or TV shot, and perhaps also a mirror-image and a shadow) to be a pictorial representation, are twofold experiences. Their first fold, the configurational fold, consists in the perception of the pictorial vehicle, i.e., the picture in its organized physical basis. Its second fold, the recognitional fold, depending for its existence on the first fold, consists in the perception of the pictorial subject, i.e., the scene the picture presents.Footnote 1

On the other hand, meaning experiences, the experiences Strawson (1994) labeled as of understanding, are also twofold experiences that are constituted by a first fold in which one perceives, either visually or auditorily, an expression in its morpho-syntactic structure and, on the top of the first, a second fold, the proper meaning fold, in which one experiences the meaning of that expression.Footnote 2

Their similarity notwithstanding, these experiences are of a different kind. For, while there is room to consider seeing-in experiences as, though sui generis, perceptual experiences, meaning experiences can only be perceptually-based experiences. For, unlike the recognitional fold, the second fold of a meaning experience is experiential, but not perceptual in character (Voltolini 2020a).

In order to argue for this result, on the one hand, one may start with noticing that in the case of a seeing-in experience, one can read off what is grasped in the recognitional fold, the pictorial subject, from what is grasped in the configurational fold, the pictorial vehicle. To begin with, in order to understand how this reading-off works, one must remark that, as Wollheim himself (1987: 46) underlines, the two folds are not the same as the corresponding experiences of the vehicle and of the subject of a picture taken in isolation. In particular, the configurational fold is not the same as the perception taken in isolation of what stands in front of the picture’s experiencer. One way of accounting for this difference is to claim that such a fold and that perception differ in their object, or better, in their object’s properties (Voltolini 2015), since the fold has a content that is richer than that of that perception. That perception grasps the physical object facing the experiencer qua mere 2D object among other physical objects, let us call it the mere picture’s vehicle. By contrast, the configurational fold grasps what we called the pictorial vehicle, or, as we can now say, the vehicle qua enriched by its grouping properties, i.e., the properties for its elements to be arranged in a certain way. In particular, these are the grouping properties organized in the third dimension;Footnote 3 namely, the properties of the vehicle’s elements to be arranged according to a certain direction along a certain dimension in a 3D space. This arrangement enables one to see in the configurational fold an item, the pictorial vehicle, which, unlike the vehicle taken in isolation, is not a mere 2D item, but a 3-D like item. Now as we said, such a grasp of the pictorial vehicle is still perceptual. For, although grouping properties are high-level properties, in their merely depending (generically)Footnote 4 on the low-level perceptual properties of the vehicle, i.e., its colors and shapes, their apprehension is perceptual. For not only that apprehension is immediate, just as the apprehension of such low-level properties, but is also based on a perceptually relevant selective form of attention (Stokes 2018); notably, a holistic form of attention that enables one to perceive the vehicle as appropriately grouped. As it may be noticed by the fact that once this form of attention is activated, the scene one perceives radically changes. This is the form of attention that Nanay (2016, 2019) takes to be focused on an object and distributed across its properties. Indeed, as Calzavarini and Voltolini (2022) maintain, immediacy and holistic attention are not only necessary, but possibly jointly sufficient conditions, for the perception of high-level properties. There is no room here to properly deal with the issue of the distinction between perception and cognition (see Stokes 2018 for details), yet definitely such criteria may help in drawing a divide between the two kinds of mental states: perceptual states are states that only grasp either low-level properties or high-level properties singled out by means of those criteria.

Moreover, in the recognitional fold of the seeing-in experience one can read off the pictorial subject from the vehicle so arranged once one perceives that vehicle in that arrangement. For that arrangement enables one to perceptually recognize that subject in that vehicle. Given that enabling, indeed, that recognition has a perceptual status as well. In other words, perceiving the vehicle so arranged makes it the case that one perceives that subject as well. More precisely, the fact that the configurational fold has an enriched content mobilizing 3D grouping properties of the pictorial vehicle enables one to recognize, in the recognitional fold, a different 3D item in that vehicle – notably, a 3D scene (or, in a very similar proposal, a spatiotemporal region: Nanay 2022) – by virtue of the fact that the content of the latter fold matches the content of the former fold; in particular, elements to which is ascribed a certain 3D location in the former fold correspond to elements to which that location is ascribed in the latter fold (Voltolini 2015). As is proved by the fact that, as Wollheim himself intuited by distinguishing seeing-in experiences from the experiences of figures in the Rorschach tests (1980:138-9), there is no voluntary or anyway arbitrary element in the subject’s apprehension, as it could be the case if that apprehension had an imaginative rather than a perceptual nature (see also Nanay 2022 for a non-imagistic but perceptual account of seeing-in experiences).

One can vividly realize that the two aforementioned folds work as stated in the seeing-in experience by appealing to a paradigmatic case, the case of experiencing ‘aspect dawning’ pictures. In this case, instead of perceiving a picture at once as one normally does, one can split an earlier perception of the mere picture’s vehicle from a later seeing-in experience of the picture. The earlier perception is just a perception of the vehicle taken in isolation, a mere 2D item characterized by its low-level properties (its colors and shapes). By contrast, the seeing-in experience is constituted not only by a perception of the pictorial vehicle as the configurational fold of that experience, hence by something that has an enriched content due to the 3D grouping properties it mobilizes, but also by the recognitional fold of that experience in which the pictorial subject is also perceived, i.e., a 3D scene matching the 3D-like silhouettes that are perceived in the configurational fold. Consider the famous picture of a Dalmatian. At time t, one merely perceives an array of black and white spots. Yet at time t’, by means of holistically attending that array, one manages to group it according to a figure-ground 3D segmentation in which a 3D dalmatianwise item is protruded out of a background. So, by now facing a pictorial vehicle, at t’ one grasps a content that is richer than the content one grasped at t, while facing a mere picture’s vehicle. By virtue of that very segmentation, finally, one is able to perceive in the vehicle enriched by that segmentation the subject that one recognizes; namely, the 3D scene of a Dalmatian in front of a background.

On the other hand, in a meaning experience one certainly perceives the visual or acoustic properties of the relevant expression, including its morphosyntactic features. Yet one cannot read off the meaning of an expression from so perceiving that expression, even in its morphosyntactical complexity. For suppose even that perceiving that expression in its morphosyntactical complexity amounts to a high-level perception in which one perceives that expression by again holistically attending to its morpho-syntactic structure that depends (generically as well)Footnote 5 on the low-level properties of that expression.Footnote 6 Nevertheless, pace Wittgenstein (1991: § 869, 20094: I§ 568), one cannot recognize in that expression so articulated its meaning in a perceptually relevant sense; by reading it off, so to say. For there definitely is no matching between the content of the perceptual fold of the meaning experience in which one sees or hears the expression in its morphosyntactic properties and the content of the other fold of that experience, the proper experience of the meaning of that expression. Indeed, as Schier (1986) originally noted, unlike pictures, linguistic expressions do not possess natural generativity (understanding one expression does not make one understand any other expression whatsoever, unless one knows its meaning), meaning is added conventionally to that expression, however one accounts for the nature of such conventionality. Hence, even if as regards a meaning experience one ends up having a twofold experience in whose first fold one perceives that expression in its morphosyntactical complexity, the second fold of that experience, the proper meaning fold, is merely juxtaposed to the first one, in the sense that, unlike a seeing-in experience, no real fusion experience arises from the simultaneous grasping of the two folds of the meaning experience. For, unlike a seeing-in experience, for the above reasons such two folds are not compenetrated.

Granted, in a meaning experience its second fold is experiential in character, and that character is irreducible to the character, admittedly perceptual, of the first fold. One’s overall experience of the expression in question indeed changes, once one understands its meaning (Siewert 1998; Horgan and Tienson 2002; Pitt 2004, Strawson 1994, Chudnoff 2015). Yet pace Brogaard (2018), for the aforementioned reason of recognition failure it is too quick to say that such a character is perceptual as well.Footnote 7 Thus, the overall meaning experience is not perceptual either; it only involves a cognitive form of phenomenology (Horgan and Tienson 2002; Pitt 2004, Strawson 1994, Chudnoff 2015). Simply, it is merely perceptually-based, since its first fold is admittedly perceptual.

In order to vividly grasp this point, consider first of all the phenomenon of satiation. Anyone has certainly experienced situations in which, by obsessively repeating a word (say, “fly”) one ends up uttering another word (say, “life”), or no word at all, but just a mere noise. Prima facie, one may think that the phenomenal change at stake in such situations perceptually involves the semantic change experienced: first, one perceives the word in a certain meaning, second, one perceives the word in another meaning, or in no meaning at all. Yet this thought is wrong. For there definitely is a perceptual change is such situations, yet this change involves no semantic level, but only the morphosyntactic level qualifying the relevant word. As is proved by the fact that a similar perceptual change may occur also when meaningless words are involved: try e.g. with the meaningless “bly”, which will finally revert into the meaningless “libe” (or into a mere noise).

In the same vein, moreover, compare the difference between a structurally ambiguous yet lexically meaningless expression and a perceptually ambiguous picture. On the one hand, take the following well-known meaningless sentence from Lewis Carroll’s Jabberwocky:

(1) The slithy toves gyred the Jabberwock in the wabe.

From a morphosyntactical point of view, one can see or hear (1) in two different readings, depending on how one differently parses, viz. groups by differently holistically attending to it, the syntagms constituting it:

(1a) (The slithy toves in the wabe) (gyred (the Jabberwock)).

(1b) (The slithy toves) (gyred (the Jabberwock in the wabe)).

Yet, since no lexical meaning has been assigned to the nouns “tove” and “wabe”, the adjective “slithy” and the verb “to gyre”, neither (1a) nor (1b) has a lexically determined meaning. So a fortiori, no meaning can be read off from either (1a) or (1b). For perceiving those readings enables one to recognize no meaning in them. Granted, once meanings were conventionally assigned to the above words, one could experience different meanings in (1a) and (1b) respectively, just as one does with the sentence inaugurating Groucho Marx’s famous joke:

(2) Yesterday I saw an elephant in my pajamas.

One would then have two different twofold meaning experiences. Yet the meaning folds of such experiences would only be juxtaposed to the admittedly perceptual folds of such experiences in which one respectively grasps the different morpho-syntactic readings of (1), without any recognitional factor being involved. Hence, the meaning folds would not be perceptual. Thus, the resulting meaning experiences would not be perceptual, but merely perceptually-based.

The very same point can be made by appealing to lexically ambiguous sentences. Who claims that in:

(3) Dionysus is Greek.

one perceives the sentence’s name as meaning Dionysus the Elder, tyrant of Syracuse, will be troubled by discovering that, while perceiving exactly the same expressions (and possibly even having the very same mental images in mind), one can also experience that name in that sentence as meaning Dionysus the Younger, son of the preceding. Clearly, the two meaning experiences related to understanding this meaning difference are different as well (Siewert 1998; Horgan and Tienson 2002, O’Callaghan 2011). Yet no perceptual recognitional work, however mediated by attention, could allow one to experience this difference. One should only know by other means that the expression is ambiguous in order to experience its different meaning (cf. Martina and Voltolini 2017).

Fig. 1
figure 1

The Rubin vase

On the other hand, take a perceptually ambiguous picture such as the Rubin vase (fig. 1) Depending on the different 3D figure-ground facewise and vasewise segmentations of the very same mere picture’s vehicle provided by differently attending that vehicle holistically, one can have different seeing-in experiences of that ambiguous picture such that one can read off the different subjects grasped in the respective recognitional folds of such experiences – namely, two white faces in profile on a black background vs. a black vase on a white background – from the respective configurational folds in which one respectively perceives those segmentations. Indeed, one can perceptually recognize such subjects respectively by virtue of such segmentations, so that those seeing-in experiences turn out to be perceptual as well.Footnote 8

At this point, on behalf of meaning perceptualism one might remark that knowing the meaning of an expression induces a different perception of it. This remark would be correct, but only up to the extent that the new perception includes morphosyntactic features of that expression that were not included in one’s original perception of it. If one knows, or even believes, that something has a certain meaning, one grasps that what one hears is not a mere noise, but (say) a morphosyntactically articulated sentence of a certain language. The import of that knowledge, or belief, would only be a form of weak cognitive penetration, in the sense defined by Macpherson (2012). Indeed, that knowledge, or belief, would basically induce a difference in the phenomenal character of the perceptual experience involved, so that such an experience would definitely come to have a new content, yet utterly non-conceptual. One sees or hears an expression in certain morphosynctatic non-conceptualized features; namely, an item properly morphosyntactically grouped. In this respect, one may notice that also the configurational fold of a seeing-in experience is weakly cognitively penetrated in the very same sense. If one knows that what one is facing is a picture of a Dalmatian, one can entertain the non-conceptually relevant perceptual change in phenomenal character that transforms the perception of a mere 2D vehicle into the non-conceptual perception of a pictorial vehicle, featured by a 3D-like dalmatianwise item endowed with certain 3D grouping properties. However, this form of weak cognitive penetration does not amount to have a seeing-in experience as a whole yet. For this form does not yet mobilize the recognitional fold of that experience. Likewise, in letting one grasp only the morphosyntactic features of an expression, the form of weak cognitive penetration affecting the perception of that expression does not mobilize a meaning experience as a whole, but only its first, genuinely perceptual, fold. So, rightly observing that such a perception is weakly cognitively penetrated says nothing in favor of the perceptual character of the meaning experience as a whole.

Following McDowell (1998), however, someone might still reply that a meaning experience can be a form of non-sensory extended perception. Now definitely, the notion of a non-sensory form of perception is highly problematic. It is quite disputable, for example, whether intellectual intuition is perceptual in more than a metaphorical sense. How we just characterized perception of high-level properties does not allow intellectual intuition to be ranked as perceptual (see also Chudnoff 2015). For although intellectual intuition may be immediate, it is not embued with holistic attention towards its object’s properties. Granted, there is a form of perception that is admittedly non-sensory; namely, amodal perception. Yet we are unclear as to how a meaning experience could be a form of amodal perception. The paradigmatic cases of amodal perception are those in which parts of objects that are otherwise sensorily grasped are occluded from other such parts, so that they are grasped by no sensory modality; the dark side of the Moon, for example. Yet no such phenomenon occurs in the case of a meaning experience. The meaning of an expression is not something that the sensorily given features of that expression occlude, in any plausible sense of the term.

All in all, we can stress that the seeing-in experiences are proper fusion experiences, in which the overall experience is different from the sum of its parts (Stumpf 1890). For, as Wollheim intuited, its folds are compenetrated, by no longer being identical with the respective experiences of the picture’s vehicle and of the picture’s subject taken in isolation.Footnote 9 By contrast, meaning experiences are not proper fusion experiences. For their second experiential fold is simply juxtaposed to its first, admittedly perceptual, fold, in its being not readable off from that fold by virtue of a content matching.

3 There is a Common Semantic System for Seeing-In Experiences and Meaning Experiences, but only in Seeing-In Experiences the Semantic Access is Perceptual

In the previous Section, we have advanced, on a purely phenomenological basis, a series of philosophical considerations in support of the idea that meaning experiences and seeing-in experiences are typologically different, that is, are not experiences of the same kind. Unlike meaning experiences, seeing-in experiences are recognitional experiences of a sort that makes them perceptual experiences. In our opinion, we can arrive at the same conclusion if we consider empirical data in addition to phenomenological intuitions.Footnote 10 In this respect, relevant questions are: Do seeing-in experiences and meaning experiences differ in timing and patterns of activation in the human brain? How do these differences (if any) relate to the nature of these experiences? As we will see, our philosophical considerations are strongly consistent with behavioral and neuroscience data.

In order to argue for this result, a first step is to show that the distinction between the two experiences is cognitively real, that is, that the two experiences are underpinned by distinct dimensions of the cognitive/neural architecture.

Against this hypothesis, however, a defender of the typological commonality might immediately rebut that, as regards their high-level aspects (that is, the recognitional fold and the proper meaning fold, respectively), there is a close relationship between those experiences in the human brain. In cognitive neuroscience, it is standardly believed that meaning experiences and seeing-in experiences ultimately converge within a shared central semantic store, a depository of conceptual representations that is equally accessible by linguistic expressions and picture forms (fig. 2). Evidence for a shared semantic system comes from observations that lesions in some cortical areas produce remarkably similar high-level deficits in both seeing-in and meaning experiences, as in the case of the patients affected by semantic dementia, who consistently show significant atrophy of the anterior temporal lobes of both hemispheres (Lambon Ralph et al. 2017a, b). Support for this hypothesis also comes from neuroimaging studies that have contrasted neural activity during semantic tasks performed either with linguistic expressions or with pictures (e.g., Vanderberghe et al. 1999, Moore and Price 1999, Bright et al. 2004 see also Binder et al. 2009). Using conjunctive analyses, these studies found robust semantic activation for both seeing-in experiences and meaning experiences in an extensive network of associative (i.e., modality-independent) areas in the left hemisphere, covering large sections of frontal and temporal regions. Yet, according to advocates of the Simulation Framework (e.g., Barsalou 1999, 2016), also called “neo-empiricism” (Prinz 2002), the common semantic system also extends to sensorimotor cortices. Within this framework, the access to the high-level proper meaning of concrete, high imageable words and sentences is supposed to re-activate regions of the brain that are involved in direct perception, such as the visual cortex (Kemmerer 2010).

Although a common semantic system might be activated similarly during seeing-in experiences and meaning experiences, however, there is also evidence that these two semantic access routes are significantly different.

Fig. 2
figure 2

A graphic representation of the semantic system (modified from Hillis & Caramazza 1991)

On the one hand, decades of empirical research have shown that the first fold of meaning experiences is supported by highly specialized neural structures in the visual and the auditory cortex, as well as in the associative cortex, with different underpinnings for orthographical (e.g., Dehaene and Cohen 2011), phonological (e.g., Liebenthal et al. 2005), and morphosyntactical (e.g., Matchin and Hickock 2020) processing. These neural structures appear to have no role in pictorial perception.

On the other hand, picture perception is known to be underpinned by a hierarchically organized perceptual stream that encodes progressively more complex information about the depicted objects and scenes, as well as information about the surface properties of the picture’s vehicle (Nanay 2011; Ferretti 2018; Vishwanath 2014). It is unlikely that all this perceptual information is codified in linguistic expressions and mobilized during meaning experiences.Footnote 11 At present, contra the Simulation Framework, it is not even established that re-activation of a detailed perceptual representation of words’ referents is necessary for language comprehension, at least not to the same degree as that activated during actual object recognition or seeing-in experiences (Calzavarini 2017; Mahon and Caramazza 2008, 2009).Footnote 12 In addition, even if a portion of the visual cortex that underpins picture perception is re-activated during meaning experiences, this visual activation involves mainly top-down rather than bottom-up cognitive mechanisms, differently from pictorial perception.Footnote 13

The existence of brain-damaged patients with profound deficits in seeing-in experiences but relatively intact meaning experiences, and vice versa, also argues against a complete overlap between the neural architectures underlying these two experiences. For example, patients with auditory verbal agnosia (Buchman et al. 1986) and deep dyslexia (Coltheart et al. 1980) are impaired on (auditory or visual) verbal understanding tasks but can normally perform on pictorial perception. Reciprocally, in several cases (Farah 2004), patients with visuoperceptual impairments showed severe pictorial impairments but achieved a normal level of verbal understanding on both spoken and written verbal tasks. Critically, several cases of semantic dementia patients have been observed whose temporal lobe atrophy was significantly more marked either on the left hemisphere or on the right hemisphere, and whose performance was disproportionally impaired in either seeing-in experiences or meaning experiences (for review, Gainotti 2012). In general, patients with left hemisphere atrophy tend to perform significantly worst on semantic tasks involving linguistic expressions as compared to pictures, while patients with right hemisphere atrophy tend to show the opposite pattern (e.g., Lambon Ralph and Howard 2000, Butler et al. 2009, Mion et al. 2010).

Considering these functional dissociations in accessing meaning from linguistic expressions as compared to pictures, several scholars have hypothesized that multiple semantic stores exist and that the pictorial and verbal access to the semantic system might be neuroanatomically segregated (e.g., Paivio 1986, Gainotti 2012, Hurley et al. 2018). In our opinion, neuroimaging studies are also consistent with this typological difference hypothesis. While several studies have been interpreted as supporting the common semantic system hypothesis, as we have argued above (e.g., Moore and Price 1999), all of these studies have reported some specific effects for seeing-in experiences and meaning experiences in addition to the regions of common activation, with a clear asymmetry between left and right hemispheres. On the one hand, meaning experiences have been associated with selective activation of the left superior and middle temporal lobes. On the other hand, seeing-in experiences increase activation in some ventral temporal regions of the right hemisphere, particularly the posterior and middle sections of the fusiform gyrus (e.g., Vandenberghe et al. 1996).Footnote 14

Given the above dissimilarities, meaning experiences and seeing-in experiences are better construed as not the same kind of experience at the cognitive level. Specifically, empirical data are consistent with functional and anatomical differentiation along the way that pictures and linguistic expressions access their respective, too hastily hypothesized to be common, high-level experiential folds. But there is more than that. In our opinion, and this is our fundamental point here, there is evidence that, unlike verbal expressions, pictures access their high-level experiential folds via perceptual and recognitional cognitive resources.

In order to grasp this point, first consider that a long tradition of behavioral studies (e.g., Paivio 1986) and studies using the electrophysiological (ERP) technique (e.g., Leonardelli et al. 2019, Shaul and Rom 2019) has experimentally demonstrated that pictorial stimuli more readily contact the semantic system as compared to linguistic expressions. This “picture superiority” effect is generally believed to be an established finding in the literature about semantic memory activation. For instance, ERP studies that have directly contrasted meaning experiences and seeing-in experiences have reported that conceptual access for linguistic expressions is delayed by about 90 msec with respect to pictures (Leonardelli et al. 2019). As noted by Shaul and Rom, «the main processing of pictures happens during the first 300 msec, while the subject perceives the visual features of the figure. This processing may be enough to reach the meaning in pictures, but words need additional processing which happens later (between 400 and 500 msec) in order to reach the semantic presentation of the word» (2019: 249). This timing profile suggests that, although both seeing-in experiences and meaning experiences appear to be immediate at the phenomenological level, significant differences exist at the cognitive level: reading-off the pictorial subject from a pictorial’s vehicle is relatively faster, in cognitive terms, than grasping the meaning of a linguistic expression in the proper meaning experiential fold. Accordingly, several scholars have argued that pictures have a faster and more direct (“privileged”) access to their high-level semantic fold, while words and sentences are interpreted to require additional translation at the representational level before accessing the semantic system (e.g., Hillis and Caramazza 1990).Footnote 15

Admittedly, a defender of the typological community might insist that these findings by themselves do not conclusively establish that seeing-in experiences are perceptual in nature. To be sure, immediacy is merely a necessary but not sufficient condition for an experience to be perceptual (Martina and Voltolini 2017; Nes 2016). Yet, critically, unlike meaning experiences, in seeing-in experiences semantic access is mediated by neural structures that have been independently associated with perceptual recognition.

Let us analyze this point more in detail. As outlined above, there is a clear left-right hemisphere asymmetry in the neural underpinnings of the two kinds of experiences. As is known, a dominant view that emerges from decades of experimental research in neuropsychology and neuroscience is that the left hemisphere is specialized for amodal and language processes, whilst the right hemisphere is specialized for visual object recognition (e.g., Gazzaniga 2000). This general trend reinforces the conjecture that, unlike meaning experiences, the overall seeing-in experience is perceptual in character.

Fig. 3
figure 3

Standard anatomical parcellation of the posterior section of the human brain. The fusiform gyrus is in light grey (from Ahveninen et al. 2012)

More specific evidence comes from neuroimaging studies. On the one hand, as we have seen, semantic processing of pictorial stimuli selectively activates the right fusiform gyrus (e.g., Bright et al. 2004), a region in the secondary visual cortex which is known to be involved in the processing of high-level visual information (Palejwala et al. 2020). Since a focal lesion in this area appears to be sufficient for generating visual recognition disorders (Konen et al. 2011), it has been suggested that the right fusiform gyrus is the main cortical substrate of the structural description system (Zannino et al. 2011) (fig. 3). According to most models of visual processing (Marr 1982, Humphreys and Forde 2001), the «structural description system represents the highest level in the visual processing stream, where incoming percepts match structural representations before accessing the semantic system» (Zannino et al. 2011: 2878). On the other hand, the semantic processing of linguistic expressions selectively engages some traditional language areas of the left temporal lobes, such as the posterior middle temporal lobe (e.g., Vandenberghe et al. 1996). According to an influent neurocognitive model of language comprehension (Hickock and Poeppel 2007), this cerebral region serves as an associative, non-perceptual interface that «maps between phonological-level representations of words or morphological roots and distributed conceptual representations» (Hickock and Small 2015: 304).Footnote 16

To sum up: empirical knowledge from cognitive neuroscience appears to vindicate the phenomenologically-based philosophical considerations we have provided in Sect. 1. From the cognitive point of view, a seeing-in experience is typologically different from a meaning experience because of its specific perceptual way of being a recognitional experience.

4 Unlike Meaning Experiences, in Seeing-in Experiences it is Possible to Read off the High-Level Content Because of Their Perceptual (Neural) Basis

In light of its putative perceptual nature, one might expect that the link between the first and the second fold in seeing-in experiences is more robust and less susceptible to brain damage as compared to what happens in meaning experiences. Only seeing-in experiences, we have argued, are proper fusion experiences. Interestingly enough, neuropsychological evidence appears to provide some support for this conjecture. On the one hand, in meaning experiences, the access to the proper meaning fold can sometimes be impaired after brain damage without this affecting the perception not only of phonological but also of morpho-syntactic properties (morpho-syntactically enriched expression fold without proper meaning fold). A notable example of this condition are patients affected by transcortical sensory aphasia, a neuropsychological syndrome that is supposed to «result from result from a one-way disruption between left hemisphere phonology and lexical–semantic processing» (Boatman et al. 2000: 1634). On the other hand, it is very rare that a brain-damaged patient knows that a certain object is a picture (in its 3D organization) without being able to access what the picture represents, i.e., the picture’s subject (configurational fold without recognition fold).Footnote 17 This observation supports the philosophical intuition that, unlike meaning experiences, in seeing-in experiences the high-level aspect is not juxtaposed to the low-level aspect, but is intimately connected to it.

In our opinion, the typological difference between seeing-in experiences and meaning experiences is further supported by an analysis of the neural underpinnings of perceptual ambiguity and lexical ambiguity. As we will see, such analysis clearly suggests that the two processes are differentiated in the human brain.

Against this hypothesis, a defender of the typological commonality might immediately rebut that that the perception of ambiguous figures such as the Necker cube or the Rubin vase, on the one hand, and the perception of lexically ambiguous words such as “bank” or “pole”, on the other, tend to activate a similar network of high-order neural structures in frontal, temporal, and parietal lobes (for reviews, Brascamp et al. 2018 and Vitello and Rodd 2015, respectively). The inferior frontal gyrus, a neural structure that has been implicated in attentive and executive functions in many studies, has consistently shown increased activation for both ambiguous pictures (e.g., Knapen et al. 2011) and words (e.g., Rodd et al. 2005). This region has been indicated as one of the most likely candidates for playing a critical role in both perceptual transitions and lexical ambiguity resolution, suggesting a close relationship between the top-down cognitive resources necessary for shifting between different semantic readings of words and pictures. This communality is also suggested, one might argue, by the advantages of bilingual children in understanding perceptual figures (e.g., Bialystok and Shapero 2005, Chung-Fat-Yim et al. 2017). This advantage might indicate the existence of common selection/inhibition attentional processes involved in both picture perception and language understanding.

Yet, these similarities notwithstanding, the neural underpinnings of ambiguous pictures and ambiguous words are clearly dissociated, with a significant hemispheric asymmetry characterizing the fronto-temporo-parietal network involved in the two processes. On the one hand, seeing-in experiences involving ambiguous pictures tend to activate right hemisphere regions (Brascamp et al. 2018). On the other hand, meaning experiences with ambiguous words are clearly left-lateralized (e.g., Hoffman and Tamm 2020).

More importantly, there is evidence that pictorial ambiguity belongs to the broader class of perceptual phenomena, while lexical ambiguity is better considered as a full high-order, cognitive process. This evidence reinforces the intuition that, unlike meaning experiences, seeing-in experiences have a perceptual nature.

To illustrate this claim, we may first rely on evidence about the time course of the cognitive shifting between different readings of ambiguous pictures and ambiguous words. On the one hand, studies using the ERP technique have revealed an early neural signal correlated with endogenous reversals of ambiguous pictures, called “reversal positivity”, which appears 130 msec after stimulus onset at occipital positions, where the early visual cortex is located (review in Kornmeier and Back 2012). The existence of this early neural signal, which has been observed for a range of ambiguous pictures such as the Necker cube, the Necker lattice, and the Old/Young woman, strongly suggests that perceptual reversals can be initiated during the first visual processing step – although high-order cognitive processes can modulate it at later stages (Abdallah and Brooks 2020).

This timing profile is certainly compatible with the involvement of perceptual, bottom-up mechanisms in seeing-in experiences with ambiguous pictures. As is well known, the existence of passive, sensory-like cognitive processes in ambiguous picture perception is confirmed by a number of traditional findings, such as the observed patterns of reversals over time (which suggest the automaticity and fatigue-like nature of this process), or the presence of adaptational effects in perceptual ambiguity (for a review, Long and Toppino 2004). Under certain accounts, alternations in ambiguous figures result from mutual inhibition/suppression processes between separate pools of neurons located in the visual cortex, each representing the information pertaining to one of the two (or more) perceptual interpretations of those figures (e.g., Toppino and Long 1987).Footnote 18 This might explain why, as noted by Block (2014: 567), alternate experiences in ambiguous figure perception are characterized by exclusivity (they are not given simultaneously), inevitability (one way of seeing the faced object will eventually replace another), and randomness (the duration of one alternative experience is not a function of previous duration).

On the other hand, empirical data shows that the shifting between different meaning folds of the same word is a significantly slower process. The standard conception in the lexical ambiguity literature is that, when listening to an ambiguous word, the different meanings are simultaneously accessed, and a first semantic selection is made starting from 200 msec from stimulus onset (Vitello and Rodd 2015). If new information is acquired that is inconsistent with this interpretation, the word must be reinterpreted. Experimental research suggests that semantic reinterpretation is a cognitively demanding process, as demonstrated by several behavioral processing costs (e.g., Rodd et al. 2010). It is commonly believed that these costs are associated with longer times for inhibiting the contextually inappropriate meaning of the ambiguous word (for example, the dog-meaning of “bark”) and (re)activate the contextually appropriate meaning (e.g., the tree-meaning). ERP studies have demonstrated that this shifting in meaning experiences starts at least 800 msec after the onset of the disambiguating cue (MacGregor et al. 2020). Indeed, this timing profile is not compatible with the involvement of sensory, bottom-up mechanisms in the semantic processing of ambiguous words. Accordingly, dominant models of lexical ambiguity resolution (e.g., Duffy et al. 2001) postulate a combination of higher-order, top-down factors involved in meaning selection and semantic reinterpretation, such as contextual knowledge or knowledge about meaning frequency. This timing profile is also at odds with the idea that meaning experiences are characterized by a specifically perceptual form of immediacy (Nes 2016; Brogaard 2018).

Again, a defender of the typological community might insist that these findings by themselves do not conclusively establish that seeing-in experiences with ambiguous figures or pictures are a perceptual phenomenon. As we have said, immediacy is merely a necessary but not sufficient condition for an experience to be perceptual.

Yet, numerous neuroimaging studies using standard, univariate fMRI have demonstrated that neural activity in both primary and secondary visual cortex correlates with the content of alternative interpretations of ambiguous pictures in seeing-in experiences (review in Sterzer et al. 2009). To take a paradigmatic case, when subjects are presented with the Rubin vase in the fMRI scanner, visual regions in the fusiform gyrus that are known to be selective for faces (e.g., the “face fusiform area”) show increased activation during face-interpretations as compared to vase-interpretations (Andrews et al. 2002). Similarly, studies using magnetoencephalography (MEG) have demonstrated that behavioral reports of alternative face and vase interpretations of the Rubin vase correlate with activity in the early visual cortex (Parkkonen et al. 2008).

Fig. 4
figure 4

The experimental paradigm in the study of Wang et al. (2017). See the text for details

In principle, these correlations may be caused by other experimental manipulations rather than the picture’s content (for example, the greater visual effort requested by processing faces as compared to objects). This is because neuroimaging techniques such as univariate fMRI or MEG are only sensitive to quantitative variations in the hemodynamic or electrical activity of the brain, and not to neural information per se.Footnote 19 In a recent study, however, Wang et al. (2017) used multivoxel pattern analysis (MVPA; see Norman 2006) to further explore the hypothesis that visual regions, in their activity patterns, carry information about fluctuating content during perception of ambiguous pictures. In the experimental condition (ambiguous condition), the subjects were presented with the Rubin vase and were asked to report, by pressing one of two buttons, any alternations between face and vase interpretations as soon as it was perceived. In the control condition (unambiguous condition), the subjects were presented with unambiguous black and white photographs of faces and vases (fig. 4). Results of this study confirm that activity patterns in the early visual cortex and the face-selective regions in the fusiform gyrus are sufficient to discriminate between facewise and vasewise segmentation of the Rubin vase. In other words, it is possible to use activity patterns in these visual regions to predict which of the two alternative perceptual contents (face or vase) is activated.

The considerations above suggest that, in seeing-in experiences, the recognitional fold can emerge from neural activity in the perceptual regions of the brain. As regards meaning experiences, this does not seem to be the case.

Hoffman and Tamm (2020), for instance, have recently used a combination of univariate and multivariate (MVPA) fMRI analyses to investigate the brain regions involved in the processing of balanced ambiguous words (that is, ambiguous words in which neither meaning was highly dominant over the other). Results of the multivariate analysis study showed that different frontal and temporal regions in the left hemisphere could discriminate between the presentation of the same words in different semantic contexts (e.g., “bark” following “tree” vs “bark” following “dog”). Thus, neural activity in these areas could be used to reliably predict which of the ambiguous word’s proper meanings was grasped by the subjects. Critically, however, all of these regions are supposed to be high-level, associative nodes distant from primary sensory cortices. For instance, one of the regions highlighted in the study was the left anterior temporal lobe. This region is modulated by conceptual processing independently to the input modality (Lambon Ralph et al. 2017a, b). Due to its multimodal neurofunctional profile, it has been suggested that the anterior temporal lobe constitutes a supramodal or amodal “hub” where conceptual information is distilled and represented in non-modal form (Patterson et al. 2007).

Again, empirical knowledge from cognitive neuroscience appears to vindicate the phenomenologically-based philosophical considerations we have provided in Sect. 1. Only in seeing-in experiences, we have seen, it is possible for the high-level fold to emerge from its perceptual (neural) basis. This reinforces the intuition that, unlike the recognitional fold, the second fold of a meaning experience is definitely experiential, but not perceptual in character.

5 Conclusion

To sum up. Both phenomenological considerations and available data in cognitive neuroscience supports the claim that, although they seem very similar, seeing-in experiences and meaning experiences are typologically different. Only when understood in a pictorial way, representations elicit a specific perceptual phenomenology and recruit specific perceptual resources of the brain. Indeed, as Goodman (1968) originally suggested, there is nothing in the representation itself that makes it pictorial or non-pictorial (verbal); everything depends on the representational system it is understood to belong to. In order to illustrate this point, let us consider the following nominal silhouette (cf. Voltolini 2015) (fig. 5):

Fig. 5
figure 5

Hitchcock’s nominal silhouette (Voltolini 2015)

In this arrangement, the mark “Alfred Hitchcock” can be naturally understood either as a word or as a picture of the famous British director. Thus, in the case of nominal silhouettes, the same representation can elicit both a seeing-in experience and a meaning experience. Given the results of this article, one should expect not only those cerebral areas that respectively implement such experiences are (at least partly) different, but also that only during the pictorial reading of nominal silhouettes semantic access is supported by perceptual regions of the brain. Interestingly, no experiment has directly contrasted neural activity during pictorial vs. verbal readings of nominal silhouettes. Further experimental research in this area might shed light on this issue, providing more support for the typological difference between seeing-in experiences and meaning experiences.Footnote 20