Introduction

Biosemiotics can be defined as a science of signs in living systems (Kull, 1999, p. 386). Here we join the effort of developing such a science. Focusing on the problem of “learning” new signs, we hope to contribute (i) to place choice at the core of semiotic theory of learning (Kull, 2018) and (ii) to make biosemiotics compatible with the information theoretic perspective that is regarded as currently dominant in physics, chemistry, and molecular biology (Deacon, 2015).

Languages use words to convey information. From a semantic perspective, words stand for meanings (Fromkin et al., 2014). Correlates of word meaning have been investigated in other species (e.g. Hobaiter & Byrne, 2014; Genty & Zuberbühler, 2014; Moore, 2014). From a neurobiological perspective, words can be seen as the counterparts of cell assemblies with distinct cortical topographies Pulvermüller (2001, 2013).

From a formal standpoint, the essence of that research is some binding between a sign or a form, e.g., a word or an ape gesture, and a counterpart, e.g. a ’meaning’ or an assembly of cortical cells. Mathematically, that binding can be formalized as a bipartite graph where vertices are forms and their counterparts (Fig. 1). Such abstract setting allows for a powerful exploration of natural systems across levels of life, from the mapping of animal vocal or gestural behaviors (Fig. 2a) into their “meanings” down to the mapping from codons into amino acids (Fig. 2b) while allowing for a comparison against “artificial” coding systems such as the Morse code (Fig. 2c) or those emerging in artificial naming games (Hurford, 1989; Steels, 1996). In that setting, almost connectedness has been hypothesized to be the mathematical condition required for the emergence of a rudimentary form of syntax and symbolic reference (Ferrer-Cancho et al., 2005, 2006). By symbolic reference, we mean here Deacon’s revision of Pierce’s view (Deacon, 1997). The almost connectedness condition is met when it is possible to reach practically any other vertex of the network by starting a walk from any possible vertex (as in Fig. 1a and b but not in Fig. 1c and d).

Fig. 1
figure 1

A bipartite graph linking forms (white circles) with their counterparts (black circles). a a connected graph b an almost connected graph c a one-to-one mapping between forms and counterparts d a mapping where only one form is linked with counterparts

Fig. 2
figure 2

Real bipartite graphs linking forms (white circles) with their counterparts (black circles). a Chimpanzee gestures and their meaning (Hobaiter & Byrne, 2014, Table S3). This table was chosen for its broad coverage of gesture types (see other tables satisfying other constraints, e.g. only gesture-meaning associations employed by a sufficiently large number of individuals). b Codon translation into amino acids, where forms are 64 codons and counterparts are 20 amino acids c The international Morse code, where forms are strings of dots and dashed and the counterparts are letters of the English alphabet (A,B,...,Z) and digits (0,1,...,9)

Since the pioneering research of G. K. Zipf (1949), statistical laws of language have been interpreted as manifestations of the minimization of cognitive costs (Zipf, 1949; Ellis and Hitchcock, 1986; Gustison et al., 2016; Ferrer-i-Cancho & Díaz-Guilera, 2007, 2019). Zipf argued that the law of abbreviation, the tendency of more frequent words to be shorter, resulted from a minimization of a cost function involving, for every word, its frequency, its “mass” and its “distance”, which in turn implies the minimization of the size of words (Zipf, 1949, p.59). Recently, it as been shown mathematically that the minimization of the average of the length of words (the mean code length in the language of information theory) predicts a correlation between frequency and duration that cannot be positive, extending and generalizing previous results from information theory (Ferrer-i-Cancho et al., 2019). The framework addresses the general problem of assigning codes as short as possible to counterparts represented by distinct numbers while warranting certain constraints, e.g., that every number will receive a distinct code (e.g. non-singular coding in the language of information theory). If the counterparts are word types from a vocabulary, it predicts the law of abbreviation as it occurs in the vast majority of languages (Bentz & Ferrer-i-Cancho, 2016). If these counterparts are meanings, it predicts that more frequent meanings should tend to be assigned smaller codes (e.g., shorter words) as found in real experiments (Kanwal et al., 2017; Brochhagen, 2021). Table 1 summarizes these and other predictions of compression.

Table 1 The application of the scientific method in quantitative linguistics (italics) with various concrete examples (roman)

A family of probabilistic models

The bipartite graph of form-counterpart associations is the skeleton (Figs. 1 and 2) on which a family of models of communication has been built (Ferrer-i-Cancho and Díaz-Guilera 2007, 2018). The target of the first of these models (Ferrer-i-Cancho & Sole, 2003) was Zipf’s rank-frequency law, that defines the relationship between the frequency of a word f and its rank i, approximately as

$$ f \approx i^{-\alpha}. $$

These early models were aimed at shedding light on mainly three questions:

  1. 1.

    The origins of this law (Ferrer-i-Cancho & Sole, 2003, 2005b).

  2. 2.

    The range of variation of α in human language (Ferrer & Cancho, 2005a, 2006).

  3. 3.

    The relationship between α and the syntactic and referential complexity of a communication system (Ferrer-i-Cancho et al., 2005, 2006).

The main assumption of these models is that word frequency is an epiphenomenon of the structure of the skeleton or the probability of the meanings. Following the metaphor of the skeleton, the models are bodies whose flesh are probabilities that are calculated from the skeleton. The first models defined p(si|rj), the probability that a speaker produces si given a counterpart rj, as the same for all words connected to rj. In the language of mathematics,

$$ p(s_{i} | r_{j}) = \frac{a_{ij}}{\omega_{j}}, $$
(1)

where aij is a boolean (0 or 1) that indicates if si and rj are connected and ωj is the degree of rj, namely the number of connections of rj with forms, i.e.

$$ \omega_{j} = \sum\limits_{i} a_{ij}. $$

These models are often portrayed as models of the assignment of meanings to forms (Futrell, 2020; Piantadosi, 2014) but this description falls short because

  • They are indeed models of production as they define the probability of producing a form given some counterparts (as in Eq. 1) or simply the marginal probability of a form. The claim that theories of language production or discourse do not explain the law (Piantadosi, 2014) has no basis and raises the questions of which theories of language production are deemed acceptable.

  • They are also models of understanding, as they define symmetric conditional probabilities such as p(rj|si), the probability that a listener interprets rj when receiving si.

  • The models are flexible. In addition to “meaning”, other counterparts were deemed possible from their birth. See for instance the use of the term “stimuli” (Ferrer-i-Cancho & Díaz-Guilera, 2007, e.g.), as a replacement for meaning that was borrowed from neurolinguistics (Pulvermüller, 2001).

  • The models fit in the distributional semantics framework (Lund & Burgess, 1996) for two reasons: their flexibility, as counterparts can be dimensions in some hidden space, and also because of representing a form as a vector of their joint or conditional probabilities with “counterparts” that is inferred from the network structure, as we have already explained (Ferrer-i-Cancho & Vitevitch, 2018).

Contrary to the conclusions of Piantadosi (2014), there are derivations of Zipf’s law that do account for psychological processes of word production, especially the intentionality of choosing words in order to convey a desired meaning.

The family of models assume that the skeleton that determines all the probabilities, the bipartite graph, is shaped by a combination of minimization of the entropy (or surprisal) of words (H) and the maximization of the mutual information between words and meanings (I), two principles that are cognitively motivated and that capture speaker and listener’s requirements (Ferrer-i-Cancho, 2018). When only the entropy of words is minimized, configurations where only one form is linked as in Fig. 1d are predicted. When only the mutual information between forms and counterparts is maximized, one-to-one mappings between forms and counterparts are predicted (when the number of forms and counterparts is the same) as in Figs. 1c or 2d. Real language is argued to be in-between these two extreme configurations (Ferrer-i-Cancho and Díaz-Guilera, 2007). Such a trade-off between simplicity (Zipf’s unification) and effective communication (Zipf’s diversification) is also found in information theoretic models of communication based on the information bottleneck approach (see Zaslavsky et al. (2021) and references there in).

In quantitative linguistics, scientific theory is not possible without taking into consideration language laws (Köhler, 1987; Debowski, 2020). Laws are seen as manifestations of principles (also referred as “requirements” by Köhler 1987), which are key components of explanations of linguistic phenomena. As part of the scientific method cycle, novel predictions are key aim (Altmann, 1993) and key to validation and refinement of theory (Bunge, 2001). Table 1 synthesizes this general view as chains of the form: laws, principles that are inferred from them, and predictions that are made from those principles, giving concrete examples from previous research.

Although one of the initial goals of the family of models was to shed light on the origins of Zipf’s law for word frequencies, a member of the family of models turned out to generate a novel prediction on vocabulary learning in children and the tendency of words to contrast in meaning (Ferrer-i-Cancho, 2017a): when encountering a new word, children tend to infer that it refers to a concept that does not have a word attached to it (Markman and Wachtel, 1988; Merriman & Bowman, 1989; Clark, 1993). The finding is cross-linguistically robust: it has been found in children speaking English (Markman & Wachtel, 1988), Canadian French (Nicoladis & Laurent, 2020), Japanese (Haryu, 1991), Mandarin Chinese (Byers-Heinlein & Werker, 2013; Hung et al., 2015), Korean (Eun-Nam, 2017). These languages correspond to four distinct linguistic families (Indo-European, Japonic, Sino-Tibetan, Koreanic). Furthermore, the finding has also been replicated in adults (Hendrickson & Perfors, 2019; Yurovsky & Yu, 2008) and other species (Kaminski et al., 2004). This phenomenon is a example of biosemiosis, namely a process of choice-making between simultaneously alternative options (Kull, 2018, p. 454).

As an explanation for vocabulary learning, the information theoretic model suffers from some limitations that motivate the present article. The first one is that the vocabulary learning bias weakens in older children (Kalashnikova et al., 2016; Yildiz, 2020) or in polylinguals (Houston-Price et al., 2010; Kalashnikova et al., 2015), while the current version of the model predicts the vocabulary learning bias only provided that mutual information maximization is not neglected (Ferrer-i-Cancho, 2017a).

The second limitation is inherited from the family of models, where the definition of the probabilities over the bipartite graph skeleton leads to a linear relationship between the frequency of a form and its number of counterparts (Ferrer-i-Cancho & Vitevitch, 2018). However, this is inconsistent with Zipf’s prediction, namely that the number of meanings μ a word of frequency f should follow (Zipf, 1945)

$$ \mu \approx f^{\delta}, $$
(2)

with δ = 0.5. Equation 2 is known as Zipf’s meaning-frequency law (Zipf, 1949). To overcome such a limitation, Ferrer-i-Cancho and Vitevitch (2018) proposed different ways of modifying the definition of the probabilities from the skeleton. Here we borrow a proposal of defining the joint probability of a form and its counterpart as

$$ p(s_{i}, r_{j}) \propto a_{ij}(\mu_{i}\omega_{j})^{\phi}, $$
(3)

where ϕ is a parameter of the model and μi and ωj are, respectively, the degree (number of connections) of the form si and the counterpart rj. Previous research on vocabulary learning in children with these models (Ferrer-i-Cancho, 2017a) assumed ϕ = 0, which leads to δ = 1 (Ferrer-i-Cancho, 2016b). When ϕ = 1, the system is channeled to reproduce Zipf’s meaning-frequency law, i.e. Equation 2 with δ = 0.5 (Ferrer-i-Cancho & Vitevitch, 2018).

Overview of the present article

It has been argued that there cannot be meaning without interpretation (Eco, 1986). As Kull (2020) puts it, “Interpretation (which is the same as primitive decision-making) assumes that there exists a choice between two or more options. The options can be described as different codes applicable simultaneously in the same situation.” The main aim to of this article is to shed light on the choice between strategy a, i.e. attaching the new form to a counterpart that is unlinked, and strategy b, i.e. attaching the new form to a counterpart that is already linked (Fig. 3).

Fig. 3
figure 3

Strategies for linking a new word to a meaning. Strategy a consists of linking a word to a free meaning, namely an unlinked meaning. Strategy b consists of linking a word to a meaning that is already linked. We assume that the meaning that is already linked is connected to a single word of degree μk. Two simplifying assumptions are considered. a Counterpart degrees do not exceed one, implying μk ≥ 1. b Vertex degrees do not exceed one, implying μk = 1

The remainder of the article is organized as follows. Section “The Mathematical Model” considers a model of a communication system that has three components:

  1. 1.

    A skeleton that is defined by a binary matrix A that indicates the form-counterpart connections.

  2. 2.

    A flesh that is defined over the skeleton with Eq. 3,

  3. 3.

    A cost function, that defines the cost of communication as

    $$ {\Omega} = -\lambda I + (1-\lambda)H, $$
    (4)

    where λ is a parameter that regulates the weight of mutual information (I) maximization and word entropy (H) minimization such that 0 ≤ λ ≤ 1. I and H are inferred from matrix A and Eq. 3 (further details are given in “The Mathematical Model”).

This section introduces Δ, i.e. the difference in the cost of communication between strategy a and strategy b according to Ω (Fig. 3). Δ < 0 indicates that the cost of communication of strategy a is lower than that of b. Our main hypothesis is that interpretation is driven by the Ω cost function and that a receiver will choose the option that minimizes the resulting Ω. By doing this, we are challenging the longstanding and limiting belief that information theory is dissociated from semiotics and not concerned about meaning (Deacon, 2015, e.g.). This article is a just one counterexample (see also Zaslavsky et al. (2018)). Information theory, as any abstract powerful mathematical tool, can serve applications that do not assume meaning (or meaning-making processes) as in the original setting of telecommunication where it was developed by Shannon, as well as others that do, although they were not his primary concern for historical and sociological reasons.

In general, the formula of Δ is complex and the analysis of the conditions where a is advantageous (namely Δ < 0) requires making some simplifying assumptions. If ϕ = 0, then one obtains that Ferrer-i-Cancho (2017a)

$$ {\Delta} = -\lambda \frac{(\omega_{j} + 1) \log (\omega_{j} + 1) - \omega_{j} \log(\omega_{j})}{M+1}, $$
(5)

where M is the number of edges in the skeleton and ωj is the degree of the already linked counterpart that is selected in strategy b (Fig. 3). Equation 5 indicates that strategy a will be advantageous provided that mutual information maximization matters (i.e. λ > 0) and its advantage will increase as mutual information maximization becomes more important (i.e. for larger λ), the linked counterpart has more connections (i.e. larger ωj) or when the skeleton has less connections (i.e. smaller M). To be able to analyze the case ϕ > 0, we will examine two classes of skeleta that are presented next.

Counterpart degrees do not exceed one

In this class, the degrees of counterparts are restricted to not exceed one, namely a counterpart can only be disconnected or connected to just one form. If meanings are taken as counterparts, this class matches the view that “no two words ever have exactly the same meaning” (Fromkin et al., 2014, p. 256), based on the notion of absolute synonymy (Dangli and Abazaj, 2009).

This class also mirrors the linguistic principle that any two words should contrast in meaning (Clark, 1987). Alternatively, if synonyms are deemed real to some extent, this class may capture early stages of language development in children or early stages in the evolution of languages where synonyms have not been learned or developed. From a theoretical standpoint, this class is required by the maximization of the mutual information between forms and counterparts when the number of forms does not exceed that of counterparts (Ferrer-i-Cancho & Vitevitch, 2018).

We use μk to refer to degree of the word that will be connected to meaning selected in strategy b (Fig. 3). We will show that, in this class, Δ is determined by λ, ϕ, μk and the degree distribution of forms, namely the vector of form degrees \(\vec {\mu }=(\mu _{1},...,\mu _{i},...\mu _{n})\).

Vertex degrees do not exceed one

In this class, the degrees of any vertex are restricted to not exceed one, namely a form (or a meaning) can only be disconnected or connected to just one counterpart (just one form). This class is narrower than the previous one because it imposes that degrees do not exceed one both for forms and counterparts. Words lack homonymy (or polysemy). We believe that this class would correspond to even earlier stages of language development in children (where children have learned at most one meaning of a word) or earlier stages in the evolution of languages (where the communication system has not developed any homonymy). From a theoretical stand point, that class is a requirement of maximizing mutual information between forms and counterparts when n = m (Ferrer-i-Cancho & Vitevitch, 2018). We will show that Δ is determined just by λ, ϕ and M, the number of links of the bipartite skeleton.

Notice that meanings with synonyms have been found in chimpanzee gestures (Hobaiter & Byrne, 2014), which suggests that the two classes above do not capture the current state of the development of form-counterpart mappings in adults of other species. Section “The Mathematical Model” presents the formulae of Δ for each classes. Section “Results” uses this formulae to explore the conditions that determine when strategy a is more advantageous, namely Δ < 0, for each of the two classes of skeleta above, that correspond to different stages of the development of language in children. While the condition ϕ = 0 implies that strategy a is always advantageous when λ > 0, we find regions of the space of parameters where this is not the case when ϕ > 0 and λ > 0. In the more restrictive class, where vertex degrees do not exceed one, we find a region where a is not advantageous when λ is sufficiently small and M is sufficiently large. The size of that region increases as ϕ increases. From a complementary perspective, we find a region where a is not advantageous (Δ ≥ 0) when λ is sufficiency small and ϕ is sufficiently large; the size of the region increases as M increases. As M is expected to be larger in older children or in polylinguals (if the forms of each language are mixed in the same skeleton), the model predicts the weakening of the bias in older children and polylinguals (Liittschwager & Markman, 1994; Yildiz, 2020; Houston-Price et al., 2010; Kalashnikova et al., 2015, 2016, 2019). To ease the exploration of the phase space for the class where the degrees of counterparts do not exceed one, we will assume that word frequencies follow Zipf’s rank-frequency law. Again, regions where a is not advantageous (Δ ≥ 0) also appear but the conditions for the emergence of this regions are more complex. Our preliminary analyses suggest that the bias should weaken in older children even for this class. Section “Discussion” reviews the findings, suggests future research directions and reviews the research program in light of the scientific method.

The Mathematical Model

Below we give more details about the model that we use to investigate the learning of new words and outlines the arguments that take from Eq. 3 to concrete formulae of Δ. Section “Δ in two Classes of Skeleta” just presents the concrete formulae Δ for each of the two classes of skeleta. Full details are given in the Supplementary Information, Section “S1 The mathematical model in detail”. The model has four components that we review next.

Skeleton (A = a ij)

A bipartite graph that defines the associations between n forms and m counterparts that are defined by an adjacency matrix A = {aij}.

Flesh (p(s i,r j))

The flesh consist of a definition of p(si,rj), the joint probability of a form (or word) and a counterpart (or meaning) and a series of probability definitions stemming from it. Probabilities depart from previous work (Ferrer-i-Cancho & Sole, 2003, 2005b) by the addition of the parameter ϕ. Equation 3 defines p(si,rj) as proportional to the product of the degrees of the form and the counterpart to the power of ϕ, which is a parameter of the model. By normalization, namely

$$ \sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m} p(s_{i}, r_{j}) = 1, $$

Equation 3 leads to

$$ p(s_{i}, r_{j}) = \frac{1}{M_{\phi}} a_{ij} (\mu_{i} \omega_{j})^{\phi}, $$
(6)

where

$$ M_{\phi} = \sum\limits_{i=1}^{n} \sum\limits_{j=1}^{m} a_{ij} (\mu_{i} \omega_{j})^{\phi}. $$
(7)

From these expressions, the marginal probabilities of a form p(si) and a counterpart p(rj) are obtained easily thanks to

$$ \begin{array}{@{}rcl@{}} p(s_{i}) = \sum\limits_{j=1}^{m} p(s_{i}, r_{j})\\ p(r_{j}) = \sum\limits_{i=1}^{n} p(s_{i}, r_{j}). \end{array} $$

The cost of communication (Ω).

The cost function is initially defined in Eq. 4 as in previous research (e.g. Ferrer-i-Cancho & Díaz-Guilera, 2007). In more detail,

$$ {\Omega} = -\lambda I(S,R) + (1 - \lambda) H(S), $$
(8)

where I(S,R) is the mutual information between forms from a repertoire S and counterparts from a repertoire R, and H(S) is the entropy (or surprisal) of forms from a repertoire S. Knowing that I(S,R) = H(S) + H(R) − H(S,R) (Cover & Thomas, 2006), the final expression for the cost function in this article is

$$ {\Omega}(\lambda) = (1-2\lambda) H(S) - \lambda H(R) + \lambda H(S,R). $$
(9)

The entropies H(S), H(R) and H(S,R) are easy to calculate applying the definitions of p(si), p(rj) and p(si,rj), respectively.

The difference in the cost of learning a new word (Δ).

There are two possible strategies to determine the counterpart with which a new form (a previously unlinked form) should connect (Fig. 3):

  • a. Connect the new form to a counterpart that is not already connected to any other forms.

  • b. Connect the new form to a counterpart that is connected to at least one other form.

The question we intend to answer is “when does strategy a result in a smaller cost than strategy b?” Or, in the terminology of child language research, “for which strategy is the assumption of mutual exclusivity more advantageous?” To answer these questions, we define Δ, as a the difference between the cost of each strategy. More precisely,

$$ {\Delta}(\lambda) = {\Omega}^{\prime}_{a}(\lambda) - {\Omega}^{\prime}_{b}(\lambda), $$
(10)

where \({\Omega }^{\prime }_{a}(\lambda )\) and \({\Omega }^{\prime }_{b}(\lambda )\) are the new value of Ω when a new link is created using strategy a or b respectively. Then, our research question becomes “When is Δ < 0?”.

Formulae for \({\Omega }^{\prime }_{a}(\lambda )\) and \({\Omega }^{\prime }_{b}(\lambda )\) are derived in two steps. First, analyzing a general problem, i.e. \({\Omega }^{\prime }\), the new value of Ω after producing a single mutation in A (“S1.2 Change in entropies after a single mutation in the adjacency matrix”, Supplementary Information). Second, deriving expressions for the case where that mutation results from linking a new form (an unlinked form) to a counterpart, that can be linked or unlinked (“S1.3 Derivation of Δ”, Supplementary Information).

Δ in two Classes of Skeleta

In previous work, the value of Δ was already calculated for ϕ = 0, obtaining expressions equivalent to Eq. 5 (see “S1.3.1 The case φ = 0”, Supplementary Information for a derivation). The next sections just summarize the more complex formulae that are obtained for each class of skeleta for ϕ ≥ 0 (“S1 The mathematical model in detail”, Supplementary Information contains details on the derivation).

Vertex degrees do not exceed one

Here forms and counterparts both either have a single connection or are disconnected. Mathematically, this can be expressed as

$$ \begin{array}{@{}rcl@{}} \mu_{i} \in \{0,1\} \text{~for each}\ i\ \text{such that~} 1 \leq i \leq n \\ \omega_{j} \in \{0,1\} \text{~for each}\ j \ \text{such that~} 1 \leq j \leq m. \end{array} $$

Figure 3b offers a visual representation of a bipartite graph of this class. In case b, the counterpart we connect the new form to is connected to only one form (ωj = 1) and that form is connected to only one counterpart (μk = 1). Under this class, Δ becomes

$$ {\kern-7.5pt}{\Delta}(\lambda) = (1 - 2\lambda) \left[ \log \left( 1 + \frac{2(2^{\phi} - 1)}{M + 1} \right) + \frac{2^{\phi+1} \log(2) \phi}{M + 2^{\phi+1} - 1} \!\right] - \lambda \frac{2^{\phi+1} \log(2)}{M + 2^{\phi+1} - 1}, $$
(11)

which can be rewritten as linear function of λ, i.e.

$$ {\Delta}(\lambda) = a \lambda + b, $$

with

$$ \begin{array}{@{}rcl@{}} a &=& 2 \log \left( 1 + \frac{2(2^{\phi} - 1)}{M + 1} \right) - (2 \phi + 1) \frac{2^{\phi+1} \log(2)}{M+2^{\phi+1}-1} \\ b &=& -\log \left( 1 + \frac{2(2^{\phi} - 1)}{M + 1} \right) + \phi \frac{2^{\phi+1} \log(2)}{M+2^{\phi+1}-1}. \end{array} $$

Importantly, notice that this expression of Δ is determined only by λ, ϕ and M (the total number of links in the model). Thorough derivations can be found in “S1.3.3 Vertex degrees do not exceed one”, Supplementary Information.

Counterpart degrees do not exceed one

This class of skeleta is a relaxation of the previous class. Counterparts are either connected to a single form or disconnected. Mathematically,

$$ \omega_{j} \in \{0,1\} \text{ for each } j \text{ such that } 1 \leq j \leq m. $$

Figure 3a offers a visual representation of a bipartite graph of this class. The number of forms the counterpart in case b is connected to is still 1 (ωj = 1) but this form may be connected to any number of counterparts; μk has to satisfy 1 ≤ μkm.

Under this class, Δ becomes

$$ \begin{array}{@{}rcl@{}} {\Delta}(\lambda) &=& (1-2\lambda) \Bigg\{ \log \Bigg(\frac{M_{\phi}+1}{M_{\phi}+\left( 2^{\phi}-1\right)\mu_{k}^{\phi}+2^{\phi}} \Bigg) \\ &&\qquad + \frac{1}{M_{\phi}+\left( 2^{\phi}-1\right)\mu_{k}^{\phi}+2^{\phi}} \Bigg[ (\phi + 1) \frac{X(S,R) (2^{\phi} - 1) (\mu_{k}^{\phi} + 1)}{M_{\phi} + 1} \\ &&\qquad - \phi 2^{\phi} \log(2) + \mu_{k}^{\phi} \Big[ \log(\mu_{k}) (\mu_{k} + \phi) \\ &&\qquad - (\mu_{k} - 1 + 2^{\phi}) \log(\mu_{k} - 1 + 2^{\phi}) \Big] \Bigg] \Bigg\} \\ &&\qquad - \frac{1}{M_{\phi}+\left( 2^{\phi}-1\right)\mu_{k}^{\phi}+2^{\phi}} \Bigg[ \lambda \big(\mu_{k}^{\phi} + 1\big) 2^{\phi} \log\big(\mu_{k}^{\phi} + 1\big) \\ &&\qquad - (1 - \lambda) \phi 2^{\phi} \mu_{k}^{\phi} \log(\mu_{k}) \Bigg], \end{array} $$
(12)

where

$$ \begin{array}{@{}rcl@{}} X(S,R) & = & \sum\limits_{i=1}^{n} \mu_{i}^{\phi+1} \log \mu_{i} \end{array} $$
(13)
$$ \begin{array}{@{}rcl@{}} M_{\phi} & = & \sum\limits_{i=1}^{n} \mu_{i}^{\phi+1}. \end{array} $$
(14)

Equation 13 can also be expressed as a linear function of λ as

$$ {\Delta}(\lambda) = a \lambda + b, $$

with

$$ \begin{array}{@{}rcl@{}} a &=& 2 \log\left( \frac{M_{\phi} + (2^{\phi} - 1) \mu_{k}^{\phi} + 2^{\phi}}{M_{\phi} + 1}\right) \\ & &\qquad - \frac{1}{M_{\phi} + (2^{\phi} - 1) \mu_{k}^{\phi} + 2^{\phi}} \Bigg\{ 2^{\phi} \left[ (\mu_{k}^{\phi} + 1) \log(\mu_{k}^{\phi} + 1) + \phi \mu_{k}^{\phi} \log(\mu_{k}) \right] \\ & &\qquad + 2 \Big[- (\phi + 1) \frac{X(S,R) (2^{\phi} - 1) \mu_{k}^{\phi} + 1}{M_{\phi} + 1} \\ & &\qquad + \phi 2^{\phi} \log(2) - \mu_{k}^{\phi} \big[ \log(\mu_{k}) (\mu_{k} + \phi) - (\mu_{k} - 1 + 2^{\phi}) \log(\mu_{k} - 1 + 2^{\phi}) \big] \Big] \Bigg\} \\ b &=& - \log\left( \frac{M_{\phi} + (2^{\phi} - 1) \mu_{k}^{\phi} + 2^{\phi}}{M_{\phi} + 1}\right) \\ & &\qquad + \frac{1}{M_{\phi} + (2^{\phi} - 1) \mu_{k}^{\phi} + 2^{\phi}} \Bigg\{\phi 2^{\phi} \mu_{k}^{\phi} \log(\mu_{k}) - (\phi + 1) \frac{X(S,R) (2^{\phi} - 1) \mu_{k}^{\phi} + 1}{M_{\phi} + 1} \\ & &\qquad + \phi 2^{\phi} \log(2) - \mu_{k}^{\phi} \left[ \log(\mu_{k}) (\mu_{k} + \phi) - (\mu_{k} - 1 + 2^{\phi}) \log(\mu_{k} - 1 + 2^{\phi}) \right] \Bigg\}. \end{array} $$

Being a relaxation of the previous class, the resulting expressions of Δ are more complex than those of the previous class, which are an in turn more complex than those of the case ϕ = 0 Eq. 5. For further details on the derivation of Δ, see “S1.3.2 Counterpart degrees do not exceed one” in the Supplementary Information.

Notice that X(S,R) Eq. 13 and Mϕ Eq. 14 are determined by the degrees of the forms (μi’s). To explore the phase space with a realistic distribution of μi’s, we assume, without any loss of generality, that the μi’s are sorted decreasingly, i.e. μ1μ2 ≥ ...μiμi+ 1 ≥ ...μn. In addition, we assume

  1. 1.

    μn = 0, because we are investigating the problem of linking and unlinked form with counterparts.

  2. 2.

    μn− 1 = 1.

  3. 3.

    Form degrees are continuous.

  4. 4.

    The relationship between μi and its frequency rank is a right-truncated power-law, i.e.

    $$ \mu_{i} = c i^{-\tau} $$
    (15)

    for 1 ≤ in − 1.

Section “S2 Form degrees and number of links” (Supplementary Information) shows that forms then follow Zipf’s rank-frequency law, i.e.

$$ p(s_{i}) = c^{\prime}i^{-\alpha} $$

with

$$ \begin{array}{@{}rcl@{}} \alpha & = & \tau(\phi+1) \\ c^{\prime} & = & \frac{(n - 1)^{\alpha}}{M_{\phi}}. \end{array} $$

The value of Δ is determined by λ, ϕ, μk and the sequence of degrees of the forms, which we have parameterized with α and n. When \(\tau = \frac {\alpha }{\phi + 1} = 0\), namely when α = 0 or when \(\phi \rightarrow \infty \), we recover the class where vertex degrees do not exceed one but with just one form that is unlinked.

A continuous approximation to the number of edges gives (“S2 Form degrees and number of links”, Supplementary Information)

$$ M = (n-1)^{\frac{\alpha}{\phi+1}} \sum\limits_{i=1}^{n-1} i^{-\frac{\alpha}{\phi+1}}. $$
(16)

We aim to shed some light on the possible trajectory that children will describe on Fig. S1 as they become older. One expects that M tends to increase as children become older, due to word learning. It is easy to see that Eq. 16 predicts that, if ϕ and α remain constant, M is expected to increase as n increases (Fig. S1). Besides, when n remains constant, a reduction of α implies a reduction of M when ϕ = 0 but that effect vanishes for ϕ > 0 (Fig. S1). Obviously, n tends to increase as a child becomes older (Saxton, 2010) and thus children’s trajectory will be from left to right in Fig. S1. As for the temporal evolution of α, there are two possibilities. Zipf’s pioneering investigations suggest that α remains close to 1 over time in English children (Zipf, 1949, Chapter IV). In contrast, a wider study reported a tendency of α to decrease over time in sufficiently old children of different languages (Baixeries et al., 2013) but the study did not determine the actual number of children where that trend was statistically significant after controlling for multiple comparisons. Then children, as they become older, are likely to move either from left to right, keeping α constant, or from the left-upper corner (high α, low n) to the bottom-right corner (low α, high n) within each panel of Fig. S1. When ϕ is sufficiently large, the actual evolution of some children (decrease of α jointly with an increase of n) is dominated by the increase of M that the growth of n implies in the long run (Fig. S1).

When exploring the space of parameters, we must warrant that μk does not exceed the maximum degree that n, ϕ and α yield, namely μkμ1, where μ1 is defined according to Eq. 15 with i = 1, i.e.

$$ \begin{array}{@{}rcl@{}} \mu_{k} & \leq & \mu_{1} \\ & = & c \\ & = & (n-1)^{\tau} \\ & = & (n-1)^{\frac{\alpha}{\phi + 1}}. \end{array} $$
(17)

Results

Here we will analyze Δ, that takes a negative value when strategy a (linking a new form to a new counterpart) is more advantageous than strategy b linking a new form to an already connected counterpart), and a positive value otherwise. |Δ| indicates the strength of the bias towards strategy a if Δ < 0; towards strategy b if Δ > 0. Therefore, when Δ < 0, the smaller the value of Δ, the higher the bias for strategy a whereas when Δ > 0, the greater the value of Δ, the higher the bias for strategy b. Each class of skeleta is analyzed separately, beginning by the most restrictive class.

Vertex Degrees do not Exceed One

In this class of skeleta, corresponding to younger children, Δ depends only on ϕ, M and λ. We will explore the phase space with the help of two-dimensional heatmaps of Δ where the x-axis is always λ and the y-axis is M or ϕ.

Figure 4 reveals regions where strategy a is more advantageous (red) and regions where b is more advantageous (blue) according to Δ. The extreme situation is found when ϕ = 0 where a single red region covers practically all space except for λ = 0 (Fig. 4, top-left) as expected from previous work (Ferrer-i-Cancho, 2017a) and Eq. 5.

Fig. 4
figure 4

Δ, the difference between the cost of strategy a and strategy b, as a function of M, the number of links and λ, the parameter that controls the balance between mutual information maximization and entropy minimization, when vertex degrees do not exceed one Eq. 11. Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b. aϕ = 0, b ϕ = 0.5, c ϕ = 1, d ϕ = 1.5, e ϕ = 2 and f ϕ = 2.5

Figure 5a summarizes these finding of regions, displaying the curve that defines the boundary between strategies a and b (Δ = 0).

Fig. 5
figure 5

Summary of the boundaries between positive and negative values of Δ when vertex degrees do not exceed one (Fig. 4 for a, Fig. S2 for b). Each curve shows the points where Δ = 0 Eq. 13 as a function of λ for distinct values of ϕ. a has M on the y-axis and each curve belongs to a distinct value of ϕ, while b has ϕ on the y-axis and each curve belongs to a distinct value of M

Figure 5b displays equivalent boundary curves summarizing Fig. S2, where ϕ replaces M on the y-axis of the heatmap. In Fig. 5b, each curve corresponds to a value of M and ϕ is placed on the y-axis.

Figure 5a and b show that strategy b is the optimal only if λ is sufficiently low, namely when the weight of entropy minimization is sufficiently high compared to that of mutual information maximization. Figure 5a shows that the larger the value of λ the larger the number of links (M) that is required for strategy b to be optimal. Figure 5a also indicates that the larger the value of ϕ, the broader the blue region where b is optimal. From a symmetric perspective, Fig. 5b shows that the larger the value of λ the larger the value of ϕ that is required for strategy b to be optimal and also that the larger the number of links (M), the broader the blue region where b is optimal.

Counterpart Degrees do not Exceed One

For this class of skeleta, corresponding to older children, we have assumed that word frequencies follow Zipf’s rank-frequency law, namely the relationship between the probability of a form (the number of counterparts connected to each form) and its frequency rank follows a right-truncated power-law with exponent α (“The Mathematical Model”). Then Δ depends only on α (the exponent of the right-truncated power law), n (the number of forms), μk (the degree of the form linked to the counterpart in strategy b as shown in Fig. 3), ϕ and λ. We will explore the phase space with the help of two-dimensional heatmaps of Δ where the x-axis is always λ and the y-axis is μk, α or n. While in the class where vertex degrees do not exceed one we have found only one blue region (a region where Δ > 0 meaning that b is more advantageous), this class yields up to two distinct blue regions located in opposite corners of the heatmap while keeping always a red region as shown in Fig. 6 for ϕ = 1 from different perspectives. For the sake of brevity, this section only presents one set of heatmaps of Δ where ϕ = 1 and μk varies on the y-axis (see “S3 Complementary figures” in the Supplementary Information for the remainder). A summary of the exploration of the parameter space follows.

Fig. 6
figure 6

Δ, the difference between the cost of strategy a and strategy b, as a function of μk, the degree of the form linked to the counterpart in strategy b as shown in Fig. 3, the number of links and λ, the parameter that controls the balance between mutual information maximization and entropy minimization, when the degrees of counterparts do not exceed one Eq. 11 and ϕ = 1. Red indicates that strategy a is more advantageous while blue indicates that b is more advantageous. The lighter the red, the stronger the bias for strategy a. The lighter the blue, the stronger the bias for strategy b. Each heatmap corresponds to a distinct combination of n and α. The heatmaps are arranged, from left to right, with α = 0.5,1,1.5 and, from top to bottom, with n = 10,100,1000. a α = 0.5 and n = 10, b α = 1 and n = 10, c α = 1.5 and n = 10, d α = 0.5 and n = 100, e α = 1 and n = 100, f α = 1.5 and n = 100, g α = 0.5 and n = 1000, h α = 1 and n = 1000, i α = 1.5 and n = 1000

Heatmaps of Δ as a function of λ and μ k.

The heatmaps of Δ for different combinations of parameters in Figs. 6S3S4S5S6 and S7 are summarized in Fig. 7, showing the frontiers between regions where Δ = 0. Notice how, for ϕ = 0, strategy a is optimal for all values of λ > 0, as one would expect from Eq. 5. The remainder of the figures show how the shape of the two areas changes with each of the parameters. For small n and α, a single blue region indicates that strategy b is more advantageous than a when λ is closer to 0 and μk is higher. For higher n or α an additional blue region appears indicating that strategy b is also optimal for high values of λ and low values of μk.

Fig. 7
figure 7

Summary of the boundaries between positive and negative values of Δ when the degrees of counterparts do not exceed one (Figs. 6S3S4S5S6 and S7). Each curve shows the points where Δ = 0 Eq. 13 as a function of λ and μk for distinct values of ϕ. a α = 0.5 and n = 10, b α = 1 and n = 10, c α = 1.5 and n = 10, d α = 0.5 and n = 100, e α = 1 and n = 100, f α = 1.5 and n = 100, g α = 0.5 and n = 1000, h α = 1 and n = 1000, i α = 1.5 and n = 1000

Heatmaps of Δ as a function of λ and α.

The heatmaps of Δ for different combinations of parameters in Figs. S8S9S10S11 and S12 are summarized in Fig. S13, showing the frontiers between regions. There is a single region where strategy b is optimal for small values of μk and ϕ, but for larger values a second blue region appears.

Heatmaps of Δ as a function of λ and n.

The heatmaps of Δ for different combinations of parameters in Figs. S14S15S16S17 and S18 are summarized in Fig. S19. Again, one or two blue regions appear depending on the combination of parameters.

See “S4 Complementary figures with discrete degrees” in the SupplementaryInformation for the impact of using discrete form degrees on the results presented in this section.

Discussion

Vocabulary Learning

In previous research with ϕ = 0, we predicted that the vocabulary learning bias (strategy a) would be present provided that mutual information minimization is not disabled (λ > 0) (Ferrer-i-Cancho, 2017a) as show in Eq. 5. However, the “decision” on whether assigning a new label to a linked or to an unlinked object is influenced by the age of a child and his/her degree of polylingualism. As for the effect of the latter, polylingual children tend to pick familiar objects more often than monolingual children, violating mutual exclusivity. This has been found for younger children below two years of age (17-22 months old in one study, 17-18 in another) (Houston-Price et al., 2010; Byers-Heinlein & Werker, 2013).

From three years onward, the difference between polylinguals and monolinguals either vanishes, namely both violate mutual exclusivity similarly (Nicoladis & Laurent, 2020; Frank & Poulin-Dubois, 2002), or polylingual children are still more willing to accept lexical overlap (Kalashnikova et al., 2015). One possible explanation for this phenomenon is the lexicon structure hypothesis (Byers-Heinlein & Werker, 2013), which suggests that children that already have many multiple-word-to-single-object mappings may be more willing to suspend mutual exclusivity.

As for the effect of age on monolingual children, the so-called mutual exclusivity bias has been shown to appear at an early age and, as time goes on, it is more easily suspended. Starting at 17 months old, children tend to look at a novel object rather than a familiar one when presented with a new word while 16-month-olds do not show a preference (Halberda, 2003). Interestingly, in the same study, 14-month-olds systematically look at a familiar object instead of a newer one. Reliance on mutual exclusivity is shown to improve between 18 and 30 months (Bion et al., 2013). Starting at least at 24 months of age, children may suspend mutual exclusivity to learn a second label for an object (Liittschwager & Markman, 1994). In a more recent study, it has been shown that three year old children will suspend mutual exclusivity if there are enough social cues present (Yildiz, 2020). Four to five year old children continue to apply mutual exclusivity to learn new words but are able to apply it flexibly, suspending it when given appropriate contextual information (Kalashnikova et al., 2016) in order to associate multiple labels to the same familiar object. As seen before, at 3 years of age both monolingual and polylingual children have similar willingness to suspend mutual exclusivity (Nicoladis & Laurent, 2020; Frank & Poulin-Dubois, 2002), although polylinguals may still have a greater tendency to accept multiple labels for the same object (Kalashnikova et al., 2015).

Here we have made an important contribution with respect to the precursor of the current model (Ferrer-i-Cancho, 2017a): we have shown that the bias is not theoretically inevitable (even when λ > 0) according a more realistic model. In a more complex setting, research on deep neural networks has shed light on the architectures, learning biases and pragmatic strategies that are required for the vocabulary learning bias to emerge (e.g. Gandhi & Lake, 2020; Gulordava et al., 2020). In “Results”, we have discovered regions of the space of parameters where strategy a is not advantageous for two classes of skeleta. In the restrictive class, where one where vertex degrees do no exceed one, as expected in the earliest stages of vocabulary learning in children, we have unveiled the existence of a region of the phase space where strategy a is not advantageous (Figs. 5a and S2). In the broader class of skeleta where the degree of counterparts does not exceed one we have found up to two distinct regions where a is not advantageous (Figs. 7 and S13).

Crucially, our model predicts that the bias should be lost in older children. The argument is as follows. Suppose a child that has not learned a word yet. Then his skeleton belongs to the class where vertex degrees do not exceed one. Then suppose that the child learns a new word. It could be that he/she learns it following strategy a or b. If he applies b then the bias is gone at least for this word. Let us suppose that the child learns words adhering to strategy a for as long as possible. By doing this, he/she will increasing the number of links (M) of the skeleton keeping as invariant a one-to-one mapping between words and meanings (Figs. 1c and 2d), which satisfies that vertex degrees do not exceed one. Then Fig. 5a and b predict that the longer the time strategy a is kept (when ϕ > 0) the larger the region of the phase space where a is not advantageous. Namely, as times goes on, it will become increasingly more difficult to keep a as the best option. Then it is not surprising that the bias weakens either in older children (e.g., Yildiz, 2020; Kalashnikova et al., 2016), as they are expected to have more links (larger M) because of their continued accretion of new words (Saxton, 2010), or in polylinguals (e.g., Nicoladis & Secco, 2000; Greene et al., 2013), where the mapping of words into meanings combining all their languages, is expected to yield more links than in monolinguals. Polylinguals make use of code-mixing to compensate for lexical gaps, as reported for from one-year-olds onward (Nicoladis & Secco, 2000) as well as in older children (five year olds) (Greene et al., 2013). As a result, the bipartite skeleton of a polylingual integrates the words and association in all the languages spoken and thus polylinguals are expected to have a larger value of M. Children who know more translation equivalents (words from different languages but with same meaning), adhere to mutual exclusivity less than other children (Byers-Heinlein & Werker, 2013). Therefore, our theoretical framework provides an explanation for the lexicon structure hypothesis (Byers-Heinlein & Werker, 2013), but shedding light on the possible origin of the mechanism, that is not the fact that there are already synonyms but rather the large number of links (Fig. 5b) as well as the capacity of words of higher degree to attract more meanings, a consequence of Eq. 3 with ϕ > 0 in the vocabulary learning process (Fig. 3). Recall the stark contrast between Fig. 6 for ϕ = 1 and Fig. S3 with ϕ = 0, where such attraction effect is missing. Our models offer a transparent theoretical tool to understand the failure of deep neural networks to reproduce the vocabulary learning bias (Gandhi & Lake, 2020): in its simpler form (vertex degrees do not exceed one), whether it is due to an excessive ϕ (Fig. 5a) or an excessive M (Fig. 5b).

We have focused on the loss of the bias in older children. However, there is evidence that the bias is missing initially in children, by the age of 14 months (Halberda, 2003). We speculate that this could be related to very young children having lower values of λ or larger values of ϕ as suggested by Figs. 5a and S2. This issue should be the subject of future research. Methods to estimate ϕ and λ in real speakers should be investigated.

Now we turn our attention to skeleta where only the degree of the counterparts does not exceed one, that we believe to be more appropriate for older children. Whereas ϕ, λ and M sufficed for the exploration of the phase space when vertex degrees do not exceed one, the exploration of that kind of skeleta involved many parameters: ϕ, λ, n, μk and α. The more general class exhibits behaviors that we have already seen in the more restrictive class. While an increase in M implies a widening of the region where a is not advantageous in the more restrictive class, the more general class experiences an increase of M when n is increased but α and ϕ remain constant (“Counterpart degrees do not exceed one”). Consistently with the more restrictive class, such increase of M leads to a growth of the regions where a is not advantageous as it can be seen in Figs. 6S4S5S6 and S7 when selecting a column (thus fixing α and ϕ) and moving from the top to the bottom increasing n. The challenge is that α may not remain constant in real children as they become older and how to involve the remainder of the parameters in the argument. In fact, some of these parameters are known to be correlated with child’s age:

  • n tends to increase over time in children, as children are learning new words over time (Saxton, 2010). We assume that the loss of words can be neglected in children.

  • M tends to increase over time in children. In this class of skeleta, the growth of M has two sources: the learning of new words as well as the learning of new meanings for existing words. We assume that the loss of connections can be neglected in children.

  • The ambiguity of the words that children learn over time tends to increase over time (Casas et al., 2018). This does not imply that children are learning all the meanings of the word according to some online dictionary but rather than as times go on, children are able to handle words that have more meanings according to adult standards.

  • α remains stable over time or tends to decrease over time in children depending on the individual (Baixeries et al., 2013; Zipf, 1949, Chapter IV).

For other parameters, we can just speculate on their evolution with child’s age. The growth of M and the increase in the learning of ambiguous words over time leads to expect that the maximum value of μk will be larger in older children. It is hard to tell if older children will have a chance to encounter larger values of μk. We do not know the value of λ in real language but the higher diversity of vocabulary in older children and adults (Baixeries et al., 2013) suggests that λ may tend to increase over time, because the lower the value of λ, the higher the pressure to minimize the entropy of words Eq. 4, namely the higher the force towards unification in Zipf’s view (Zipf, 1949). We do not know the real value of ϕ but a reasonable choice for adult language is ϕ = 1 (Ferrer-i-Cancho & Vitevitch, 2018).

Given the complexity of the space of parameters in the more general class of skeleta where only the degrees of counterparts cannot exceed one, we cannot make predictions that are as strong as those stemming from the class where vertex degrees cannot exceed one. However, we wish to make some remarks suggesting that a weakening of the vocabulary learning bias is also expected in older children for this class (provided that ϕ > 0). The combination of increasing n and a value of α that is stable over time suggests a weakening of the strategy a over time from different perspectives

  • Children evolve on a column of panels (constant α) of the matrix of panels in Figs. 6S4S5S6 and S7, moving from top (low n) to the bottom (large n). That trajectory implies an increase of the size of the blue region, where strategy a is not advantageous.

  • We do not know the temporal evolution of μk but once μk is fixed, namely a row of panels is selected in Figs. S8S9S10S11 and S12, children evolve from left (lower n) to right (higher n), which implies an increase of the size of the blue region where strategy a is not advantageous as children become older.

  • Within each panel in Figs. S14S15S16S17 and S18, an increase of n, as a results of vocabulary learning over time, implies a widening of the blue region.

In the preceding analysis we have assumed that α remains stable over time. We wish to speculate on the combination of increasing n and decreasing α as time goes on in certain children. In that case, children would evolve close to the diagonal of the matrix of panels, starting from the right-upper corner (low n, high α, panel (c)) towards the lower-left corner (high n, low α, panel (g)) in Figs. 6S4S5S6 and S7, which implies an increase of the size of the blue region where strategy a is not advantageous. Recall that we have argued that a combined increase of n and decrease of α is likely to lead in the long run to an increase of M (Fig. S1). We suggest that the behavior ”along the diagonal” of the matrix is an extension of the weakening of the bias when M is increased in the more restrictive class (Fig. 5b).

In our exploration of the phase space for the class of the skeleta where the degrees of counterparts do not exceed one, we assumed a right-truncated power-law with two parameters, α and n as a model for Zipf’s rank-frequency law. However, distributions giving a better fit have been considered (Li et al., 2010) and function (distribution) capturing the shape of the law of what Piotrowski called saturated samples (Piotrowski & Spivak, 2007) should be considered in future research. Our exploration of the phase space was limited by a brute force approach neglecting the negative correlation between n and α that is expected in children where α and time are negatively correlated: as children become older, n increases as a result of word learning (Saxton, 2010) but α decreases (Baixeries et al., 2013). A more powerful exploration of the phase space could be performed with a realistic mathematical relationship of the expected correlation between n and α, which invites to empirical research. Finally, there might be deeper and better ways of parameterizing the class of skeleta.

Biosemiotics

Biosemiotics is concerned about building bridges between biology, philosophy, linguistics, and the communication sciences as announced in the front page of this journal. As far as we know, there is little research on the vocabulary learning bias in other species. Its confirmation in a domestic dog suggests that “the perceptual and cognitive mechanisms that may mediate the comprehension of speech were already in place before early humans began to talk” (Kaminski et al., 2004). We hypothesize that the cost function Ω captures the essence of these mechanisms. A promising target for future research are ape gestures, where there has been significant progress recently on their meaning (Hobaiter & Byrne, 2014). As far as we know, there is no research on that bias in other domains that also fall into the scope of biosemiotics, e.g., in unicellular organisms such as bacteria. Our research has established some mathematical foundations for research on the accretion and interpretation of signs across the living world, not only among great apes, a key problem in research program of biosemiotics (Kull, 2018).

The remainder of the discussion section is devoted to examine general challenges that are shared by biosemiotics and quantitative linguistics, a field that, as biosemiotics, aspires to contribute to develop a science beyond human communication.

Science and its Method

It has been argued that a problem of research on the rank-frequency is law is the The absence of novel predictions... which has led to a very peculiar situation in the cognitive sciences, where we have a profusion of theories to explain an empirical phenomenon, yet very little attempt to distinguish those theories using scientific methods. (Piantadosi, 2014). As we have already shown the predictive power of a model whose original target was the rank-frequency laws here and in previous research (Ferrer-i-Cancho, 2017a), we take this criticism as an invitation to reflect on science and its method (Altmann, 1993; Bunge, 2001).

The generality of the patterns for theory construction

While in psycholinguisics and the cognitive sciences a major source of evidence are often experiments involving restricted tasks or sophisticated statistical analyses covering a handful of languages (typically English and a few other Indo-European languages), quantitative linguistics aims to build theory departing from statistical laws holding in a typologically wide range of languages (Köhler, 1987; Debowski, 2020) as reflected in Fig. 1. In addition, here we have investigated a specific vocabulary learning phenomenon that is, however, supported cross-linguistically (recall “Introduction”). A recent review on the efficiency of languages, only pays attention to the law of abbreviation (Gibson et al., 2019) in contrast with the body of work that has been developed in the last decades linking laws with optimization principles (Fig. 1), suggesting that this law is the only general pattern of languages that is shaped by efficiency or that linguistic laws are secondary for deep theorizing on efficiency. In other domains of the cognitive sciences, the importance of scaling laws has been recognized (Chater & Brown, 1999; Kello et al., 2010; Baronchelli et al., 2013).

Novel predictions

In “Vocabulary Learning”, we have checked predictions of our information theoretic framework that matches knowledge on the vocabulary learning bias from past research. Our theoretical framework allows the researcher to play the game of science in another direction: use the relevant parameters to guide the design of new experiments with children or adults where more detailed predictions of the theoretical framework can be tested. For children who have about the same n and α, and ϕ = 1, our model predicts that strategy a will be discarded if (Fig. 6)

  1. (1)

    λ is low and μk (Fig.3) is large enough.

  2. (2)

    λ is high and μk is sufficiently low.

Interestingly, there is a red horizontal band in Fig. 6, and even for other values of ϕ such that ϕ≠ 1 but keeping ϕ > 0 (Figs. S4S5S6S7), indicating the existence of some value of μk or a range of μk where strategy a is always advantageous (notice however, that when ϕ > 1, the band may become too narrow for an integer μk to fit as suggested by Figs. S23, S24, S25 in the Supplementary Information, Section “S4 Complementary figures with discrete degrees”). Therefore the 1st concrete prediction is that, for a given child, there is likely to be some range or value of μk where the bias (strategy a) will be observed. The 2nd concrete prediction that can be made is on the conditions where the bias will not be observed. Although the true value of λ is not known yet, previous theoretical research with ϕ = 0 suggests that λ ≤ 1/2 in real language (Ferrer-i-Cancho & Sole, 2003; Ferrer-i-Cancho, 2005b; 2006; 2005a), which would imply that real speakers should satisfy only (1). Child or adult language researchers may design experiments where μk is varied. If successful, that would confirm the lexicon structure hypothesis (Byers-Heinlein & Werker, 2013) but providing a deeper understanding. These are just examples of experiments that could be carried out.

Towards a mathematical theory of language efficiency

Our past and current research on the efficiency are supported by a cost function and a (analytical or numerical) mathematical procedure that links the minimization of the cost function with the target phenomena, e.g., vocabulary learning, as in research on how pressure for efficiency gives rise to Zipf’s rank-frequency law, the law of abbreviation or Menzerath’s law (Gustison et al., 2016; Ferrer-i-Cancho, 2005b, 2019). In the cognitive sciences, such a cost function and the mathematical linking argument are sometimes missing (e.g., Piantadosi et al., 2011) and neglected when reviewing how languages are shaped by efficiency (Gibson et al., 2019). A truly quantitative approach in the context of language efficiency is two-fold: it has to comprise either a quantitative description of the data and a quantitative theorizing, i.e. it has to employ both statistical methods of analysis and mathematical methods to define the cost and the how cost minimization leads to the expected phenomena. Our framework relies on standard information theory (Cover & Thomas, 2006) and its extensions (Ferrer-i-Cancho et al., 2019; Debowski, 2020). The psychological foundations of the information theoretic principles postulated in that framework and the relationships between them have already been reviewed (Ferrer-i-Cancho, 2018). How the so-called noisy-channel “theory” or noisy-channel hypothesis explains the results in (Piantadosi et al., 2011), others reviewed recently (Gibson et al., 2019) or language laws in a broad sense has not yet shown, to our knowledge, with detailed enough information theory arguments. Furthermore, the major conclusions of the statistical analysis of Piantadosi et al. (2011) have recently been shown to change substantially after improving the methods: effects attributable to plain compression are stronger than previously reported (Meylan and Griffiths, 2021). Theory is crucial to reduce false positives and replication failures (Stewart & Plotkin, 2021). In addition, higher order compression can explain more parsimoniously phenomena that are central in noisy-channel “theorizing” (Ferrer-i-Cancho, 2017b).

The trade-off between parsimony and perfect fit

Our emphasis is on generality and parsimony over perfect fit. Piantadosi (2014) makes emphasis on what models of Zipf’s rank-frequency law apparently do not explain while our emphasis is on what the models do explain and the many predictions they make (Table 1), in spite of their simple design. It is worth reminding a big lesson from machine learning, i.e. a perfect fit can be obtained simply by overfitting the data and another big lesson from the philosophy of science to machine learning and AI: sophisticated models (specially deep learning ones) are in most cases black boxes that imitate complex behavior but neither explain nor yield understanding. In our theoretical framework, the principle of contrast (Clark, 1987) or the mutual exclusivity bias (Markman and Wachtel, 1988; Merriman & Bowman, 1989) are not principles per se (or core principles) but predictions of the principle of mutual information maximization involved in explaining the emergence of Zipf’s rank-frequency law (Ferrer-Cancho 2003, 2005b) and word order patterns (Ferrer-i-Cancho, 2017b). Although there are computational models that are able to account for that vocabulary learning bias and other phenomena (Frank et al., 2009; Gulordava et al., 2020), ours is much simpler, transparent (in opposition to black box modeling) and to the best our knowledge, the first to predict that the bias will weaken over time providing a preliminary understanding of why this could happen.