Lip reading is the ability to partially understand speech by looking at the speaker's lips. It improves the intelligibility of speech in noise when audio-visual perception is compared with audio-only perception. A recent set of experiments showed that seeing the speaker's lips also enhances sensitivity to acoustic information, decreasing the auditory detection threshold of speech embedded in noise [J. Acoust. Soc. Am. 109 (2001) 2272; J. Acoust. Soc. Am. 108 (2000) 1197]. However, detection is different from comprehension, and it remains (...) to be seen whether improved sensitivity also results in an intelligibility gain in audio-visual speech perception. In this work, we use an original paradigm to show that seeing the speaker's lips enables the listener to hear better and hence to understand better. The audio-visual stimuli used here could not be differentiated by lip reading per se since they contained exactly the same lip gesture matched with different compatible speech sounds. Nevertheless, the noise-masked stimuli were more intelligible in the audio-visual condition than in the audio-only condition due to the contribution of visual information to the extraction of acoustic cues. Replacing the lip gesture by a non-speech visual input with exactly the same time course, providing the same temporal cues for extraction, removed the intelligibility benefit. This early contribution to audio-visual speech identification is discussed in relationships with recent neurophysiological data on audio-visual perception. (shrink)
We consider a computational model comparing the possible roles of and in phonetic decoding, demonstrating that these two routes can contain similar information in some communication situations and highlighting situations where their decoding performance differs. We conclude that optimal decoding should involve some sort of fusion of association and simulation in the human brain.
One of the fundamental questions raised by Ruchkin, Grafman, Cameron, and Berndt's (Ruchkin et al.'s) interpretation of no distinct specialized neural networks for short-term storage buffers and long-term memory systems, is that of the link between perception and memory processes. In this framework, we take the opportunity in this commentary to discuss a specific working memory task involving percept formation, temporary retention, auditory imagery, and the attention-based maintenance of information, that is, the verbal transformation effect.
We agree with MacNeilage's claim that speech stems from a volitional vocalization pathway between the cingulate and the supplementary motor area (SMA). We add the vocal self- monitoring system as the first recruitment of the Broca-Wernicke circuit. SMA control for “frames” is supported by wrong consonant-vowel recurring utterance aphasia and an imaging study of quasi-reiterant speech. The role of Broca's area is questioned in the emergence of “content,” because a primary motor mapping, embodying peripheral constraints, seems sufficient. Finally, we reject (...) a uniquely peripheral account of speech emergence. (shrink)
Speech is a perceptuo-motor system. A natural computational modeling framework is provided by cognitive robotics, or more precisely speech robotics, which is also based on embodiment, multimodality, development, and interaction. This paper describes the bases of a virtual baby robot which consists in an articulatory model that integrates the non-uniform growth of the vocal tract, a set of sensors, and a learning model. The articulatory model delivers sagittal contour, lip shape and acoustic formants from seven input parameters that characterize the (...) configurations of the jaw, the tongue, the lips and the larynx. To simulate the growth of the vocal tract from birth to adulthood, a process modifies the longitudinal dimension of the vocal tract shape as a function of age. The auditory system of the robot comprises a “phasic” system for event detection over time, and a “tonic” system to track formants. The model of visual perception specifies the basic lips characteristics: height, width, area and protrusion. The orosensorial channel, which provides the tactile sensation on the lips, the tongue and the palate, is elaborated as a model for the prediction of tongue-palatal contacts from articulatory commands. Learning involves Bayesian programming, in which there are two phases: specification of the variables, decomposition of the joint distribution and identification of the free parameters through exploration of a learning set, and utilization which relies on questions about the joint distribution. Two studies were performed with this system. Each of them focused on one of the two basic mechanisms, which ought to be at work in the initial periods of speech acquisition, namely vocal exploration and vocal imitation. The first study attempted to assess infants’ motor skills before and at the beginning of canonical babbling. It used the model to infer the acoustic regions, the articulatory degrees of freedom and the vocal tract shapes that are the likeliest explored by actual infants according to their vocalizations. Subsequently, the aim was to simulate data reported in the literature on early vocal imitation, in order to test whether and how the robot was able to reproduce them and to gain some insights into the actual cognitive representations that might be involved in this behavior. Speech modeling in a robotics framework should contribute to a computational approach of sensori-motor interactions in speech communication, which seems crucial for future progress in the study of speech and language ontogeny and phylogeny. (shrink)