Bootstrapping the lexicon: A computational model of infant speech segmentation

doi:10.1016/S0010-0277(02)00002-1

Cognition

Volume 83, Issue 2, March 2002, Pages 167-206

https://doi.org/10.1016/S0010-0277(02)00002-1 Get rights and content

Abstract

Prelinguistic infants must find a way to isolate meaningful chunks from the continuous streams of speech that they hear. BootLex, a new model which uses distributional cues to build a lexicon, demonstrates how much can be accomplished using this single source of information. This conceptually simple probabilistic algorithm achieves significant segmentation results on various kinds of language corpora – English, Japanese, and Spanish; child- and adult-directed speech, and written texts; and several variations in coding structure – and reveals which statistical characteristics of the input have an influence on segmentation performance. BootLex is then compared, quantitatively and qualitatively, with three other groups of computational models of the same infant segmentation process, paying particular attention to functional characteristics of the models and their similarity to human cognition. Commonalities and contrasts among the models are discussed, as well as their implications both for theories of the cognitive problem of segmentation itself, and for the general enterprise of computational cognitive modeling.

Introduction

One of the infant's early tasks is to break up continuous streams of speech into more manageable chunks that can be attached to meaning. The problem can be represented schematically:

A successful segmentation – one which locates “words” – is a logically necessary preparation for the more complex language learning which follows. Since each language has different words, and different regularities for word formation, successful segmentation cannot be due to innate knowledge.¹

That the child succeeds in discovering words early and often is clear. According to Mandel, Jusczyk, and Pisoni (1995), infants as young as 4.5 months can distinguish their own names, said in isolation, from other names which are similar in stress pattern (e.g. Joshua vs. Agatha, Brandon vs. Kevin) and prefer them, as shown by significantly longer looking times. At 6 months English-learning children understand “mommy” and “daddy” to refer to their own parents (Tincoff & Jusczyk, 1999). Although there is wide individual variation,² by 1 year 4 months of age most children have a comprehension vocabulary of at least 50 words (Harris & Chasin, 1999).

This first word comprehension, or “the child's dawning appreciation of some of the conventional meaning units of the adult language” (Vihman, 1996, p. 122), is one result of a successful chunking or segmentation process. Various sources of information that the infant might use for word segmentation have been proposed, and behavioral experiments with infants have tested the availability and effectiveness of prosodic information like pauses, stress, and intonational contours,³ phonetic cues to word boundaries,⁴ phonotactics,⁵ and the distribution of sounds in the speech stream,⁶ as well as tests of two or more of these strategies working in combination.⁷ Research in this area has expanded lately to the point where space does not permit a proper review here; for comprehensive surveys, see Jusczyk, 1997, Jusczyk, 1999 and Aslin, Jusczyk, and Pisoni (1998).

In this paper, I will focus on just one of these sources of information – the distribution of segmental information,⁸ or the relative frequency of sounds and sound clusters, and their tendencies to co-occur with each other and with utterance boundaries. Distributional information comes from observing the frequency of events in the environment, a skill available to even the tiniest infant, and indeed to most non-human animals; for reviews of research on the cognitive effects of frequency, see Hasher and Zacks (1984), Alloy and Tabachnik (1984), and Kelly and Martin (1994). In experiments specific to language stimuli, 8-month-old infants successfully segmented an artificial speech stream based solely on distributional information – frequency and order (Saffran et al., 1996a, Saffran et al., 1996b) – and the same stimuli drew similar responses from tamarin monkeys (Hauser, Newport, & Aslin, 2001). The infant experiment has been replicated with naturally spoken syllables (Johnson & Jusczyk, 2001).

Here we will be concerned not with the behavioral data, but rather with computational models of the use of distributional cues to segment words. In particular, this paper describes BootLex, a model of early word segmentation which uses the distribution of segments and pauses to discover word boundaries in several language corpora from three different languages. Second, several previously reported computer models of the same cognitive process are reviewed and compared to BootLex, not only in terms of the usual quantitative measures of effectiveness, but also by contrasting their more global functional characteristics. I hope to show that comparison of models of this small but critical cognitive process can highlight aspects of the problem – both cognitive and computational – that might otherwise be overlooked.

Section 2 of the paper describes how speech segmentation is modeled by computers, and how the performance of such models has been evaluated quantitatively, and then previews the qualitative characteristics that we will contrast in the several models. Section 3 presents the BootLex algorithm in detail. Section 4 discusses three groups of other computer models, and compares them with BootLex and with each other. Section 5 compares the cognitive plausibility of these models, and considers some broader implications.

Section snippets

Distributional models of infant speech segmentation

A number of computational models of the use of statistical cues for infant speech segmentation have been presented recently. These computer models, including BootLex, are inductive, or self-organizing, algorithms. With the significant exception of the categories implicit in the coded input, they have no linguistic knowledge to begin with. That is, there is no lexicon of known words or knowledge of applicable rules or regularities, such as phonotactics. They can only try to discover any

The BootLex algorithm

Olivier (1968) was the first to create a working probabilistic segmentation routine. His algorithm was a deceptively simple exercise in self-organization, using only letter co-occurrence frequencies to segment utterances into words, and the BootLex model is a new implementation based on his idea.¹³ Because Olivier's algorithm

Other model strategies

A number of computational models of segmentation using other paradigms have been reported recently, falling into three main groups:

(i)
Three connectionist networks
(ii)
Two algorithms using the minimum description length principle
(iii)
Two algorithms based on a formal statistical model called “Model-based dynamic programming” (MBDP)

All these models interpret the cognitive problem of word segmentation similarly, as discussed above, but there are significant differences among them in goals and methods. Each

From computer model to infant cognition

The previous two sections have presented the BootLex algorithm and compared it in some detail with two other groups of models, both in terms of quantitative performance and more global characteristics of design and function. In this final section, we examine the claims of these computational models to be cognitive models – to go beyond the purely engineering goal of an end product that is comparable with that realized by human infants, and also demonstrate similarities in process.

The relation

Conclusion

A new model, BootLex, was shown to be a conceptually simple and effective segmentation procedure. Based on observation of frequently appearing phoneme clusters and their relationship to utterance boundaries, a lexicon was built incrementally and used to recognize words and parse incoming utterances, with the results fed back to further modify the lexicon. The algorithm was tested on a number of corpora with a variety of characteristics. Then, two other groups of models which have been applied

Acknowledgements

The research reported here was conducted as partial fulfillment of the requirements for the degree of Doctor of Philosophy. I thank my thesis supervisor, Virginia Teller, and the members of my committee, Virginia Valian and Martin Chodorow. Portions of this manuscript were written while I was a Foreign Research Fellow of the Japanese Society for the Promotion of Science, appointed on the recommendation of the National Science Foundation, and hosted by Nobuo Ohta at the University of Tsukuba.

References (88)

M.R. Brent
Speech segmentation and word discovery: a computational perspective
Trends in Cognitive Sciences
(1999)
M.R. Brent et al.
Distributional regularity and phonotactic constraints are useful for segmentation
Cognition
(1996)
P. Cairns et al.
Bootstrapping word boundaries: a bottom-up corpus-based approach to speech segmentation
Cognitive Psychology
(1997)
C.H. Echols et al.
The perception of rhythmic units in speech by infants and adults
Journal of Memory and Language
(1997)
J.L. Elman
Finding structure in time
Cognitive Science
(1990)
L. Gerken et al.
When prosody fails to cue syntactic structure: nine-month-olds’ sensitivity to phonological vs. syntactic phrases
Cognition
(1994)
M.D. Hauser et al.
Segmentation of the speech stream in a non-human primate: statistical learning in cotton-top tamarins
Cognition
(2001)
K. Hirsh-Pasek et al.
Clauses are perceptual units for young infants
Cognition
(1987)
E.K. Johnson et al.
Word segmentation by 8-month-olds: when speech cues count more than statistics
Journal of Memory and Language
(2001)
P.W. Jusczyk
Constraining the search for structure in the input
Lingua
(1998)

R.N. Aslin et al.

Speech and auditory processing during infancy

R.N. Aslin et al.

Computation of conditional probability statistics by 8-month-old infants

Psychological Science

(1998)

R.N. Aslin et al.

Models of word segmentation in fluent maternal speech to infants

Batchelder, E. O. (1997). Computational evidence for the use of frequency information in discovery of the infant's...

N. Bernstein Ratner

From ‘signal to syntax’: but what is the nature of the signal?

M.R. Brent

An efficient, probabilistically sound algorithm for segmentation and word discovery

Machine Learning

(1999)

Brent, M. R., & Siskind, J. M. (2000). The role of exposure to isolated words in early vocabulary development. NECI TR...

R. Brown

A first language

(1973)

P. Cairns et al.

Lexical segmentation: the role of sequential statistics in supervised and un-supervised models

T.A. Cartwright et al.

Segmenting speech without a lexicon: evidence for a bootstrapping model of lexical acquisition

E. Charniak

Statistical language learning

(1993)

M.H. Christiansen et al.

Learning to segment speech using multiple cues: a connectionist model

Language and Cognitive Processes

(1998)

A. Christophe et al.

Do infants perceive word boundaries? An empirical study of the bootstrapping of lexical acquisition

Journal of the Acoustical Society of America

(1994)

A. Cleeremans

Mechanisms of implicit learning: connectionist models of sequence processing

(1993)

D. Crystal

The Cambridge encyclopedia of language

(1987)

D. Dahan et al.

On the discovery of novel wordlike units from utterances: an artificial-language study with implications for native-language acquisition

Journal of Experimental Psychology: General

(1999)

Cited by (61)

No need to forget, just keep the balance: Hebbian neural networks for statistical learning
2023, Cognition
Language processing in humans has long been proposed to rely on sophisticated learning abilities including statistical learning. Endress and Johnson (E&J, 2021) recently presented a neural network model for statistical learning based on Hebbian learning principles. This model accounts for word segmentation tasks, one primary paradigm in statistical learning. In this discussion paper we review this model and compare it with the Hebbian model previously presented by Tovar and Westermann (T&W, 2017a; 2017b; 2018) that has accounted for serial reaction time tasks, cross-situational learning, and categorization paradigms, all relevant in the study of statistical learning. We discuss the similarities and differences between both models, and their key findings. From our analysis, we question the concept of “forgetting” in the model of E&J and their suggestion of considering forgetting as the critical ingredient for successful statistical learning. We instead suggest that a set of simple but well-balanced mechanisms including spreading activation, activation persistence, and synaptic weight decay, all based on biologically grounded principles, allow modeling statistical learning in Hebbian neural networks, as demonstrated in the T&W model which successfully covers learning of nonadjacent dependencies and accounts for differences between typical and atypical populations, both aspects that have not been fully demonstrated in the E&J model. We outline the main computational and theoretical differences between the E&J and T&W approaches, present new simulation results, and discuss implications for the development of a computational cognitive theory of statistical learning.
Does morphological complexity affect word segmentation? Evidence from computational modeling
2022, Cognition
Citation Excerpt :
Computational modeling work has started to investigate word segmentation in various languages (Batchelder, 2002; Blanchard, Heinz, & Golinkoff, 2010; Caines, Altmann-Richer, & Buttery, 2019; Daland, 2009; Fleck, 2008; Fourtassi, Börschinger, Johnson, & Dupoux, 2013; Kastner & Adriaans, 2017; Pearl & Phillips, 2018; Saksida et al., 2017). Providing a thorough overview of their findings is beyond the scope of the present study, but we would like to highlight that most previous work attempts to check how a given algorithm performs cross-linguistically to argue for the validity of the algorithm the authors of those studies proposed, rather than to understand whether language properties affect segmentation in a systematic way (e.g., Batchelder, 2002; Boruta, Peperkamp, Crabbé, & Dupoux, 2011; M. Johnson, 2008; Pearl & Phillips, 2018; Phillips & Pearl, 2014a, Phillips & Pearl, 2014a). Exceptions include studies that try to explain away cross-linguistic differences on the basis of corpus characteristics (e.g., Caines et al., 2019; Fourtassi et al., 2013), and work assessing the effect of prosodic and syntactic structure such as head direction (saliently, Gervain & Erra, 2012; Saksida et al., 2017), or the effects of input representation (Kastner & Adriaans, 2017).
How can infants detect where words or morphemes start and end in the continuous stream of speech? Previous computational studies have investigated this question mainly for English, where morpheme and word boundaries are often isomorphic. Yet in many languages, words are often multimorphemic, such that word and morpheme boundaries do not align. Our study employed corpora of two languages that differ in the complexity of inflectional morphology, Chintang (Sino-Tibetan) and Japanese (in Experiment 1), as well as corpora of artificial languages ranging in morphological complexity, as measured by the ratio and distribution of morphemes per word (in Experiments 2 and 3). We used two baselines and three conceptually diverse word segmentation algorithms, two of which rely purely on sublexical information using distributional cues, and one that builds a lexicon. The algorithms' performance was evaluated on both word- and morpheme-level representations of the corpora. Segmentation results were better for the morphologically simpler languages than for the morphologically more complex languages, in line with the hypothesis that languages with greater inflectional complexity could be more difficult to segment into words. We further show that the effect of morphological complexity is relatively small, compared to that of algorithm and evaluation level. We therefore recommend that infant researchers look for signatures of the different segmentation algorithms and strategies, before looking for differences in infant segmentation landmarks across languages varying in complexity.
A distributional perspective on the gavagai problem in early word learning
2021, Cognition
Word learning entails the mapping of an auditory word-form to its appropriate grammatical category (e.g., noun, verb, adjective), but before that mapping can occur, the naïve learner must infer which of the myriad of possible referents of that word was intended by the speaker. This creates a computational explosion of referential ambiguity referred to as the gavagai problem. In a set of corpus analyses of parent-directed speech to young infants, we describe the distributional information available to early word learners, with a focus on nouns and adjectives that refer to whole objects and object properties. And in two experiments on word-learning in adults spanning seven different distributional conditions, we document how variations in the ratio of novel labels for objects and properties affect the robustness of word learning. Our results suggest that the language input to 6- to 20-month-olds is robustly populated with high-frequency object words and high-frequency property words, but their co-occurrence is sparse. Although this distributional information slightly favors object words over property words, a more plausible account of the whole-object bias in early word learning is the inability to encode the details of an object/event during rapid naming. Our results from adults, presented with novel labels for multi-referent objects in a cross-situational statistical learning paradigm, also reveal this whole-object bias as well as the absence of property-label generalization to novel objects, even when the distribution of labels is shifted almost exclusively to property words. These results are discussed in terms of the relative ease of mapping auditory word-forms to whole objects vs. object properties, thereby limiting the combinatorics of the gavagai problem, especially in infants with immature encoding and memory representation abilities.
Chunks of phonological knowledge play a significant role in children's word learning and explain effects of neighborhood size, phonotactic probability, word frequency and word length
2021, Journal of Memory and Language
Citation Excerpt :
This view has fruitfully been applied to word segmentation – locating word boundaries within continuous speech, a feat typically achieved by the developing infant between the ages of around 0;6–1;6. For example, BootLex (Batchelder, 2002) parses continuous speech into potential words by a combination of knowledge of optimal word length and selection of the (incrementally chunked) phoneme sequences having the highest combined frequency; while TRACX (French, Addyman & Mareschal, 2011) shows over a series of studies how recognition of previous frequently encountered phoneme sequences is able to mimic behavior in studies of segmentation. Similar to TRACX, our view records no frequency information; rather, frequently encountered phoneme sequences form larger and larger chunks.
A key omission from many accounts of children’s early word learning is the linguistic knowledge that the child has acquired up to the point when learning occurs. We simulate this knowledge using a computational model that learns phoneme and word sequence knowledge from naturalistic language corpora. We show how this simple model is able to account for effects of word length, word frequency, neighborhood density and phonotactic probability on children’s early word learning. Moreover, we show how effects of neighborhood density and phonotactic probability on word learning are largely influenced by word length, with our model being able to capture all effects. We then use predictions from the model to show how the ease by which a child learns a new word from maternal input is directly influenced by the phonological knowledge that the child has acquired from other words up to the point of encountering the new word. There are major implications of this work: models and theories of early word learning need to incorporate existing sublexical and lexical knowledge in explaining developmental change while well-established indices of word learning are rejected in favor of phonological knowledge of varying grain sizes.
When forgetting fosters learning: A neural network model for statistical learning
2021, Cognition
Citation Excerpt :
For example, network models (such as Simple Recurrent Networks; Elman, 1990) are directional, and thus do not account for backward TPs, while their sensitivity to non-adjacent TPs will likely depend on the network parameters. “Chunking models” that store items in memory (Batchelder, 2002; Perruchet & Vinter, 1998; Thiessen, 2017) and information-theoretic models (or related Bayesian models) that minimize storage space in memory (Brent & Cartwright, 1996; Orbán et al., 2008) will not track (adjacent or non-adjacent) TPs in unattested items, and thus do not account for the entire range of data either. Here, we suggest that an ability to succeed in the crucial test cases above follows naturally from a correlational learning mechanism such as Hebbian learning.
Learning often requires splitting continuous signals into recurring units, such as the discrete words constituting fluent speech; these units then need to be encoded in memory. A prominent candidate mechanism involves statistical learning of co-occurrence statistics like transitional probabilities (TPs), reflecting the idea that items from the same unit (e.g., syllables within a word) predict each other better than items from different units. TP computations are surprisingly flexible and sophisticated. Humans are sensitive to forward and backward TPs, compute TPs between adjacent items and longer-distance items, and even recognize TPs in novel units. We explain these hallmarks of statistical learning with a simple model with tunable, Hebbian excitatory connections and inhibitory interactions controlling the overall activation. With weak forgetting, activations are long-lasting, yielding associations among all items; with strong forgetting, no associations ensue as activations do not outlast stimuli; with intermediate forgetting, the network reproduces the hallmarks above. Forgetting thus is a key determinant of these sophisticated learning abilities. Further, in line with earlier dissociations between statistical learning and memory encoding, our model reproduces the hallmarks of statistical learning in the absence of a memory store in which items could be placed.
Statistical learning and memory
2020, Cognition
Learners often need to identify and remember recurring units in continuous sequences, but the underlying mechanisms are debated. A particularly prominent candidate mechanism relies on distributional statistics such as Transitional Probabilities (TPs). However, it is unclear what the outputs of statistical segmentation mechanisms are, and if learners store these outputs as discrete chunks in memory. We critically review the evidence for the possibility that statistically coherent items are stored in memory and outline difficulties in interpreting past research. We use Slone and Johnson's (2018) experiments as a case study to show that it is difficult to delineate the different mechanisms learners might use to solve a learning problem. Slone and Johnson (2018) reported that 8-month-old infants learned coherent chunks of shapes in visual sequences. Here, we describe an alternate interpretation of their findings based on a multiple-cue integration perspective. First, when multiple cues to statistical structure were available, infants' looking behavior seemed to track with the strength of the strongest one — backward TPs, suggesting that infants process multiple cues simultaneously and select the strongest one. Second, like adults, infants are exquisitely sensitive to chunks, but may require multiple cues to extract them. In Slone and Johnson's (2018) experiments, these cues were provided by immediate chunk repetitions during familiarization. Accordingly, infants showed strongest evidence of chunking following familiarization sequences in which immediate repetitions were more frequent. These interpretations provide a strong argument for infants' processing of multiple cues and the potential importance of multiple cues for chunk recognition in infancy.

View all citing articles on Scopus

View full text