Caregiving in humans is a universal characteristic with a long evolutionary history (Lang & Fowers, 2019). Indeed, it has been suggested that caregiving is a primate adaptation (de Waal, 1996; Warneken & Tomasello, 2006). Caregiving is critical for children’s survival, particularly in infancy, but also during early childhood (between the ages of about 3 and 7 years) and beyond. Accordingly, human adults seem to have developed a series of adaptations in order to provide proper care to vulnerable offspring: (a) a sensitivity to neotenous cues and distress vocalizations; (b) specific tactile behaviors such as skin-to-skin contact; (c) attachment-related behaviors; and (d) the possibility of experiencing compassionate affect when faced with infants’ suffering (Goetz et al., 2010). For their part, children seem to have evolved simultaneously a series of adaptations to capture adults’ attention and convey their needs and emotions, what Trivers (1974:249) referred to as “psychological weapons in order to compete with their parents.” Thus, parenting involves a series of complex cost–benefit decisions wherein both adults’ and children’s inclusive fitness and behaviors guide final parental investment (Lancaster et al., 2010). Two well-studied examples of such “psychological weapons” in infancy are the organization of facial features and crying.

Ethologist Konrad Lorenz (1943) was the first to suggest that typical infant facial features, including large, rounded cheeks, a flat nose, rounded head, large head relative to body size, and adult-sized eyes (the “infant schema,” or Kindchenschema), might serve as an innate releasing mechanism that promotes positive adult caretaking behaviors. Subsequent research has generally confirmed this view (e.g., Franklin & Volk, 2018; Glocker et al., 2009; Senese et al., 2013), with allure toward cute infant faces being present in adolescence, and even in childhood (Borgi et al., 2014; Fullard & Reiling, 1976; Luo et al., 2020), mediated by both hormonal and neural systems (Kringelbach et al., 2016; Luo et al., 2015).

Crying is another extensively studied infant adaptation. It serves to maintain proximity with potential caregivers and to guide caregivers’ behavior (Bowlby, 1969; Soltis, 2004). Indeed, acoustic properties of crying convey critical information about infants’ needs and emotions (e.g., Wolff, 1969) and physical condition (e.g., Furlow, 1997). Crying particularly draws parents’ attention when it is different from the typical infants’ crying (Chittora & Patil, 2017), and it has been suggested to be a focal point for regulating parental investment (e.g., De Vries, 1984) and maltreatment (e.g., Frodi, 1985). Fathers seem to be as good as mothers at recognizing their babies’ cries, if they have spent enough time with them (Gustafsson et al., 2013), and, curiously enough, it has been found that infants adjust their crying melody to their native language. For example, fundamental frequency (F0) or pitch contour (melody) of crying is different in French and German newborns, shaping the melody of their respective mother languages (Mampe et al., 2009). Similar differences have also been reported between German, Chinese, and Nso (Cameroon) infants’ crying (Wermke et al., 2016, 2017).

Cues Signaling Need for Care during Early Childhood

In contrast to infancy, we know much less about the potential cues signalling the need for care expressed by children during early childhood. This is somewhat surprising given that children are still highly vulnerable following weaning (typically at the age of about 3 years, on average, in traditional societies; Dettwyler, 2017), and, in most traditional societies, they require a long period of intensive allomaternal care (e.g., Konner, 2010; Lancy, 2015). We do know that facial cues do not seem to be as important for promoting caregiving during early childhood as they are during infancy. For example, according to adults’ judgments, 4.5-year-old children’s faces do not differ significantly in attractiveness and likeability from those of adults (Luo et al., 2011). However, the remarkable improvement in language skills following infancy increases the relevance of children’s speech as cues for the need for care. In this vein, children who verbalize certain types of immature explanations of ordinary phenomena (what has been called “supernatural thinking”: e.g., “The sun’s not out today because it’s mad,” “The big peak is for long walks, and the small peak is for short walks”) are perceived more positively and helpless by both adults and older adolescents (14 to 17 years old) than children verbalizing more mature, adult-like explanations of the same phenomena (e.g., “The sun’s not out today because the clouds are blocking it”) (Bjorklund et al., 2010; Periss et al., 2012). In addition, these cues of immature thinking have been shown to prevail over physical cues (e.g., faces) in both adults and older adolescents when both are available (Hernández Blasi & Bjorklund, 2018; Hernández Blasi et al., 2015, 2017). However, little is known yet, to our knowledge, about the potential role of the voice (vocal cues) as a cue for the need for careging during early childhood, and this is the main purpose of the present study.

Vocalizations as Cues of Immaturity

Vocal communication is ubiquitous in mammals, birds, amphibians, and reptiles (Hauser, 1996; Titze, 2017), dating back perhaps 30 million years (Belin et al., 2011). Quite probably, early nonhuman primates and our most direct hominin ancestors used it to convey long-distance information about dangers (alarm calls) and opportunities, signalling and maintaining dominance, and finding mates (Cook, 2002; Seyfarth & Cheney, 2003). In humans, vocal communication seems to have preceded the evolution of speech (Cook, 2002; Titze, 2017). Voices are indeed special for the human brain (Belin 2011), with some voice-selective neuron populations already present in 7-month-old infants (Grossman et al., 2010).

Human voices vary on a series of parameters, such as pitch, intensity, and timbre. Pitch, as expressed by fundamental frequency (F0) (an acoustic property linked to the vibration rate of vocal folds during phonation, measured in Hertz, Hz), is likely one of the most salient and empirically studied parameters (Cook, 2002; Rosenfield et al., 2020; Titze, 2000). Significant changes in pitch take place during early childhood, puberty, and later adulthood, often driven by hormonal changes (Titze, 2000). Newborns’ crying has a fundamental frequency between 400 and 600 Hz, whereas 3- to 6-years-olds’ fundamental frequency averages about 265 Hz (e.g., Capellari & Cielo, 2008; Michelsson & Michelsson, 1999; Trollinger, 2003). Sex differences in children’s voices can be identified beginning by about 4 years of age, although not always accurately, particularly for girls (e.g., Karlsson & Rothenberg, 1987; Perry et al., 2001; Sergeant et al., 2005). From 7 to 17 years of age, fundamental frequency lowers on average from about 250 Hz to about 200 Hz in females and to about 125 Hz in males, with the greatest decrease occurring in boys at 13–14 years of age (e.g., Balasubramaniam & Nikhita, 2017; Berger et al., 2019; Schneider et al., 2010).

Adults estimate children’s age relatively well based on their voices until children are about 11 years of age, and they then systematically tend to underestimate the age of older girls (Assmann et al., 2013). Adults identify children’s sex better as children get older, although they are more accurate for older boys than for older girls (Assmann et al., 2011), and height is more precisely predicted from voice when the sex of the child is known—for instance, when an older girl is misidentified as a boy, her height is typically underestimated (Assmann et al., 2018). Adults’ ability to estimate the age of young children relatively accurately on the basis of their natural voices may have made differences in children’s voices a good target for natural selection to use as a cue to immaturity and the need for care.

The Current Study

The current study assessed the potential role of children’s voices as cues for needing care during early childhood (between the ages of about 3 and 7 years) by exploring how adults’ perception of some children’s traits can be inferred from the maturity of their voices. To this purpose, we presented samples of voices of both preschool (about 5 years old) and school-age (about 10 years old) children to groups of college students and asked their impressions about the degree of positive affect, negative affect, intelligence, and helplessness evoked by the voices. We also measured participants’ reaction times to make their decisions.

The samples of voices were recorded in natural settings while children verbalized neutral-content sentences. We hypothesized, first, that children with immature voices would be perceived by adults as having more positive affect and being more helpless than children with mature voices. Based on earlier research on adults’ perception of children’s verbalized thinking and facial features using the same paradigm (e.g., Bjorklund et al., 2010; Hernández Blasi et al., 2015), we also predicted that there would likely be no significant differences on negative-affect ratings between children with immature and those with mature voices. Finally, we made no predictions about adults’ reactions regarding items reflecting intelligence. On the one hand, one might expect, according to previous literature indicating that adults can accurately estimate children’s age based on their voices (e.g., Assmann et al., 2013), that children with mature voices would be identified as older and thus be more apt to be selected on intelligence items than children with immature voices. On the other hand, research has also shown that children with mature faces are not always considered as more intelligent than children with immature faces (e.g., Hernández Blasi & Bjorklund, 2018), and this could possibly be the case for voices.

Second, we anticipated longer reaction times for the Negative-Affect items than for the other trait dimensions given that, as shown in previous research (e.g., Hernández Blasi et al., 2017), adults have more difficulty (i.e., take longer to make a decision) when having to assess children on negative as opposed to positive items. We did not make any predictions about which trait dimension (Positive Affect, Intelligence, Helpless), if any, would likely be the easiest for participants in terms of speed of decision-making.

Method

Participants

The sample consisted of 74 adults (61 female, Mage = 21.6 years, SD = 6 years, age range = 17–54 years) attending a public urban university in eastern Spain. All participants were college students, most taking classes in psychology (60, 81%), with the remainder taking classes in other degrees (e.g., education) or educational levels (e.g., master’s). Their socioeconomic background was mainly middle class, typical in public universities in Spain. Participants were tested individually at the researchers’ laboratory. All participants volunteered for this study and received a small monetary compensation (2 euros). The study was approved by the University Research Ethics Committee.

Design

To obtain samples of the children’s voices, 53 children aged 3 to 12 years old (26 boys and 27 girls) were recorded at their school with parental permission. Children were audiorecorded individually in a small and relatively isolated and noise-free classroom using a TASCAM DR-40 digital recorder. After establishing rapport, we asked children to repeat two practice sentences: (1) “Today we are at the school ‘Grans i Menuts’” and (2) “My name is [child’s given name] and your name is [researcher’s given name].” Each child was then asked to repeat four neutral-content sentences, sequentially read aloud by one of the experimenters: (1) “I like the beach more than the mountains,” (2) “I like the mountains more than the beach,” (3) “I like traveling more by plane than by car,” and (4) “I like travelling more by car than by plane.” (These are English translations of the Spanish sentences that were recorded.) When a problem was detected (e.g., a pronunciation problem; voice was too low; a change in the words of the sentence), we asked children to repeat the sentence.

Edits of the four sentences for each child were made using the free open-source audio-editor Audacity (version 2.1.0). Measures of the pitch (fundamental frequency, Hz) and intensity (volume, dB) of the children’s voices were taken by means of Praat (version 6.0.24), a free-access software for phonetic speech analysis, designed by Paul Boersma and David Weenink from the University of Amsterdam. Recordings of four boys and four girls were selected (four 5- and four 10-year-olds), and four different sets of sentences were generated. Each set contained the eight children and the four neutral sentences, with four pairs of a 5- versus a 10-year-old child of the same sex verbalizing the same sentence. However, in each set, each of the four neutral sentences was articulated by a different pair of children, such that each pair of children verbalized the four neutral sentences across the four sets, but a different one per set.Footnote 1

Table 1 presents mean pitch (fundamental frequency, Hz) for 5- and 10-year-old boys’ and girls’ voices across the four sets. Intensity (volume, dB) of all the voice samples was equalized to about 72 dB volume, which, according to some voice experts (e.g., Bustos, 2012), would be within the typical range for spoken voice in Spain (65 to 75 dB). More specific criteria used for selecting the vocal stimuli are described in Appendix 2.

Table 1 Mean pitch (fundamental frequency, in Hz) of the boys’ and girls’ voices, across the four neutral sentences verbalized in every set. (standard deviations in parenthesis)

We used E-prime (version 2.0) professional software to implement our design into a computerized experimental protocol in a way that allowed us to obtain data on participants’ decisions and reaction times. Within every set, the order of presentation of each pair of voices was counterbalanced. The order of presentation of the 10-year-old voices and the 5-year-old voices in each pair was also counterbalanced across sets.

Procedure

Participants were tested individually at a university laboratory. An experimenter explained the procedure to the participant, who was assigned to one of the four sets generated for each condition. Birthdate, sex, and university degree program were the only personal data collected from the participants. The experiment was presented on an Acer V193 HQV LCD 18.5″ wide monitor, and the auditory stimuli were played through SHURE SRH440 adjustable headphones. Following completion of the study, which lasted between 5 and 10 min (although there was no time limit to complete the experiment), participants were thanked and given 2 euros.

The participants first read instructions about the experimental procedure on the computer screen (“Now you will listen to some short sentences. After listening to them, press any key [of the keyboard] and a series of questions will show up that you will have to answer one by one.”). Participants were told they should (1) press the key with a yellow sticker on it (over the letter Z of a QWERTY keyboard) to select the child whose voice was associated with an icon on the left side of the screen or (2) press the key with a green sticker on it (over the letter M) to select the child whose voice was associated with an icon on the right side of the screen. Participants were then told that they could press one of the two keys with red stickers on them (over the numbers 1 and 0) to listen again to either of the children’s voices. The participant was then presented with two practice trials (one with a pair of boys’ voices, and another with a pair of girls’ voices) to become familiar with the procedure.

Following practice, the participant was presented with four new pairs of children’s voices (two pairs of boys and two pairs of girls) appearing sequentially. Every pair presentation started with a short instruction printed at the upper side of the screen (Fig. 1): “Please listen to the following sentences and press any key to continue.” Five seconds later an icon consisting of a small black circle with a white speaker inside appeared from the middle-left side of the screen with a written indication below (e.g., “Boy A” or “Girl A”), and the first neutral sentence (e.g., uttered by a 5-year-old child) was presented. One second later, the same audiovisual sequence was repeated for the second child of the pair (e.g., a 10-year-old child), this time with the icon appearing from the middle-right side of the screen (e.g., “Boy B” or “Girl B”), in parallel and at the same height as the previous one. The two children (A and B) verbalized exactly the same neutral sentence. Once the participant had listened to the two sentences and pressed a key on the keyboard, a written question appeared from the bottom of the screen: “Which of the two children do you think is the most [adjective or short statement in red]?” After the first question was answered, 13 more questions with their corresponding 13 adjectives or short statements were delivered sequentially. Then a new pair of children, verbalizing the next neutral sentence, was presented on the screen following the same procedure. The order of presentation of the 14 adjectives or short statements was systematically and automatically counterbalanced by the computer program.

Fig. 1
figure 1

Core audiovisual sequence presented to participants for this experiment on the computer screen. Screen 1: Five seconds after the instruction at the top of the screen appeared, an icon emerged from the middle-left side of the screen with a “Boy A” or “Girl A” indicated below, and then a neutral sentence (e.g., uttered by a 5-year-old child) was presented. Screen 2: One second later the second icon with a “Boy B” or “Girl B” indicated below appeared, and the same neutral sentence uttered by the other child of the pair (in this example, a 10-year-old child) was presented. Screen 3: After the participant pressed a key on the keyboard, the first question appeared at the bottom of the screen; once the first question was answered, 13 more questions with their corresponding 13 adjectives or short statements were delivered sequentially, in random order, at the same screen place

These 14 adjectives and short statements have been used in previous research (e.g., Bjorklund et al., 2010; Hernández Blasi et al., 2017) and constitute a selection of traits that are potentially meaningful in understanding interactions between adults and young children. Based on principal component analyses performed in these previous studies, we grouped the items into four factors or trait dimensions: Positive Affect (cute, friendly, nice, likeable), Negative Affect (sneaky, likely to lie, feel more irritated with, feel more angry with), Intelligence (smart, intelligent), and Helpless (helpless, feel more protective towards, feel like helping). One item (curious) did not load highly on any factor, and it was not included in subsequent analyses.

Results

When participants selected 5-year-old children for an adjective or short statement, their response was coded as 1, and when they selected 10-year-old children, their response was coded as 0. Therefore, mean scores significantly greater than 0.5 reflect that participants selected the 5-year-old children’s voices more often, whereas mean scores significantly less than 0.5 indicate that participants selected the 10-year-old children’s voices more often. Table 2 presents the proportion of participants who selected the 5-year-old (immature) voices by trait dimension (Positive Affect, Negative Affect, Intelligence, Helpless), as well as the mean reaction time per item.

Table 2 Proportion of participants selecting the 5-year-old child, and mean reaction time (in milliseconds) by trait dimension (Positive Affect, Negative Affect, Intelligence, Helpless) (standard deviations in parenthesis)

To analyze mean scores we first applied a series of two-tailed, single-sample t-tests (p < 0.001 to adjust for multiple contrasts) to determine whether the 5-year-old or the 10-year-old children’s voices were selected significantly greater than expected by chance (0.5). As can be seen in Table 2, 5-year-old children’s voices were selected for the Positive-Affect and the Helpless items significantly greater than expected by chance. In contrast, 10-year-old children’s voices were selected for the Intelligence items significantly greater than expected by chance. Participants’ selections did not differ from chance for the Negative-Affect items.Footnote 2

To further assess the pattern of performance, we computed two one-way analyses of variance with repeated measures on trait dimension (Positive Affect vs. Negative Affect vs. Intelligence vs. Helpless) for the proportion of participants who selected the 5-year-old voices and the reaction times. Preliminary analyses revealed no significant sex differences, as well as no significant differences between the four sets of voices used in the experiment for either of these variables. Thus, our analyses collapsed data across sex and the four sets of voices.

The analysis of variance for the proportion of participants who selected the 5-year-old voices produced a significant effect of trait dimension, F(2.52, 183.75) = 101.07, p < 0.001, ηp2 = 0.58 (Helpless, M = 0.85 > Positive Affect, M = 0.62 > Negative Affect, M = 0.51 > Intelligence, M = 0.19, p values < 0.001, according to post-hoc Bonferroni t-tests). Analysis of variance of reaction times similarly yielded a significant main effect of trait dimension, F(3, 219) = 13.30, p < 0.001, ηp2 = 0.15. Post-hoc Bonferroni t-tests (p < 0.001) indicated that participants took more time to process Negative-Affect items (M = 2478.35 ms) than items of any of the other trait dimensions, which did not differ from one another (Positive Affect, M = 2162.49 = Intelligence, M = 2069.71 ms = Helpless, M = 2002 ms).Footnote 3

Discussion

The main purpose of the present study was to shed light on the potential role that children’s voices during early childhood may play as cues to adults for the need for care, in a way that might guide caregivers’ attention and action. To that end, we presented college students with a series of voice samples of 5-year-old versus 10-year-old children and asked them to rate them on a series of trait dimensions (positive affect, negative affect, helpless, intelligence). We also measured their reactions times in doing so. We predicted, consistent with previous research examining the effect of children’s immature thinking on adult perceptions (e.g., Bjorklund et al., 2010; Hernández Blasi et al., 2015, 2017), that children with immature voices would be selected more frequently for positive affect and rated as being more helpless than children with mature voices, and that maturity of children’s voices would have no effect on negative-affect ratings. We were less certain about which children would be deemed as more intelligent. We also hypothesized that reaction times would be longer for the Negative-Affect items than for the other trait dimensions.

As for the first hypothesis, as predicted, children with immature voices (i.e., the 5-year-olds) were perceived as having greater positive affect and being more helpless than children with mature voices (i.e., the 10-year-olds). In addition, children with mature voices were deemed as higher in intelligence, whereas neither children with the immature nor mature voices were selected more often than expected by chance for the Negative-Affect items. With respect to the second hypothesis, decision-making on the Negative-Affect items was significantly the most time consuming, as predicted, reflecting the difficulty participants have making attributions of negative traits with respect to children (cf. Hernández Blasi et al., 2017).

Two important conclusions derive from the results of this study. First, acoustic features of children’s voices can provide meaningful information to adults, regardless of the content of speech, potentially important in terms of care and upbringing during early childhood. And second, when compared with results from studies assessing both cognitive cues and facial cues of preschool children (cf. Hernández Blasi et al., 2015), voices seem to be as informative for adults as cognitive cues, and more informative than faces. This is illustrated by comparing the results of the current study with those from a study examining adults’ judgments based on children’s facial features and cognitive cues (i.e., expressions of immature supernatural cognition). Figure 2 presents (1) the proportion of adults selecting the 5-year-old children’s voices in the current study and, based on data from Hernández Blasi et al. (2015), (2) the proportion of adults selecting children professing immature supernatural thinking, and (3) the proportion of adults selecting the immature face (i.e., about 5 years old versus about 10 years old) for each of the four trait dimensions. As can be seen, children with immature voices were selected for the Positive-Affect and Helpless items at similar levels, and more often than expected by chance, to children in the Hernández Blasi et al. (2015) study who expressed immature cognition. Likewise, adults selected the mature and immature children at chance levels for both voices and expressions of cognition for the Negative-Affect items, but not for Intelligence items, where the mature children were selected more often than expected by chance in both cases. In contrast, children with immature faces provoked comparable effects on adults for both cognitive and vocal cues in terms of positive affect and negative affect, but differed from the vocal and cognitive cues for the Helpless and Intelligence items, perfoming at chance levels.

Fig. 2
figure 2

Proportion of people selecting the child with the immature voice (Voices-Only), the immature supernatural thinking (Vignettes-Only), and the immature face (Faces-Only) by trait dimension (Positive Affect, Negative Affect, Intelligence, Helpless). Voices-Only data is from the current study; Vignettes-Only and Faces-Only data from Hernández Blasi et al. (2015) for comparative purposes. Note: t tests on scores between .40 and .60 were not statistically different from chance in their corresponding studies; scores above .60 reflect that immature children were selected significantly greater than expected by chance; and scores below .40 indicate that mature children were selected significantly greater than expected by chance

Taken together, currently available research on the effects of immature cognitive, facial, and vocal cues in adults depicts a scenario where it is possible to make a preliminary proposal on the potential role of each of these cues during early development in terms of attracting potential caregivers’ attention and conveying to adults critical information about children’s needs. According to this research, faces seem to be powerful cues for caregivers during infancy (e.g., Glocker et al., 2009; Lorenz, 1943), but apparently they are not especially influential during early childhood (e.g., Luo et al., 2011, 2020), though they remain effective in arousing positive affect (e.g., Hernández & Bjorklund, 2018; Hernández Blasi et al., 2015). Conversely, cognitive cues seem to be particularly useful for caregivers during early childhood (e.g., Bjorklund et al., 2010; Periss et al., 2012), but they are not that valuable during infancy given infants’ productive language constraints. To our knowledge, the potential role of cognitive nonverbal cues in infancy to attract adults’ attention and signalling some critical information to them has not been studied in these terms. Finally, vocal cues seem to be equally worthwhile for caregivers during both infancy (e.g., through crying, as we reviewed earlier) and early childhood, as we have seen in this study. In light of this, it can be argued that vocal cues are likely one of the most reliable cues for caregivers across early development.

A number of limitations of the current study require caution in the interpretation of findings. First, whereas there is an extensive literature examining the effects on adults of infants’ and children’s facial cues (see Franklin & Volk, 2018), far fewer studies have examined the effects of infants’ and children’s nonfacial cues, particularly voices, on adults’ perceptions and behavior, and thus the results of the present study require replication and extension (although the near-identical results using simulated as opposed to children’s natural voices, reported in Appendix 1, serves as an initial replication of the present findings). Moreover, several methodological limitations of the present study must be acknowledged. (1) Our sample is mostly composed of young adults (barely 21 years old) from a WEIRD population likely without any parental experience. (2) Women were disproportionately represented in our sample, making comparisons between men and women statistically problematic. However, men and women in our study (and in the Simulated-Voices study, reported in Appendix 1) responded similarly in all conditions. Moreover, as women are (and historically have been) the primary caregivers in all societies, our results likely have ecological validity. (3) Our experimental scenario involved unknown (vs. one’s own) focal children and measured hypothetical (vs. real-life) reactions in adults, and (4) there is no distinction between the different parameters of the children’s voices, nor the degree of adults’ attention or caregiving behavior toward each voice. Future research is needed to address each of these limitations.

Indeed more research is needed on different fronts. For example, we need to know more about when in development these effects of children’s voices on adults’ reactions start. In previous research it was found, for example, that the effect of young children’s cognitive cues on adults is present in late adolescence (14–17 years old) but not before (Periss et al., 2012). Likewise, no developmental changes have been found with respect to children’s facial cues, with an identical pattern to that observed in adults being present in early adolescence (10–13 years old) (Hernández Blasi & Bjorklund, 2018; Luo et al., 2020). Nonetheless, the developmental pattern of the effects of children’s voices on adults’ perceptions is still lacking.

More research is also needed on the potential effects of different parameters of children’s voices and about the information the different parameters might convey to adults. In this study, we controlled for the maturity of children’s natural voices, but information in voices is multidimensional, and numerous different levels of information (lexical, grammatical, prosody, fluency, fundamental frequency, timbre) can potentially influence listeners’ perception of maturity of children’s voices and deserve further attention. Results in the Simulated-Voices condition described in Appendix 1 suggest, for instance, that pitch variations, as correlated with fundamental frequency, is a highly informative parameter of children’s voices. Similarly, adults in the current study might have used voice maturity as a cue to children’s age and, on that basis, made inferences about children’s intelligence or degree of helplessness. Although this is possible, children’s voices, much as their faces, may signal not only children’s age but also other features that are potentially important for parental decisions and investment (Franklin & Volk, 2018). For example, cuteness ratings based on children’s faces are highly correlated with health, age (negatively), and happiness ratings (Volk et al., 2007), and it seems likely that features in children’s voices may serve similar functions.

Another issue that needs to be addressed is the relative importance of young children’s vocal cues when either cognitive or facial cues are also available. For example, in a condition where a child with an immature voice and a mature face is pitted against a child with a mature voice and an immature face, which child, if either, would be deemed the most helpless—the one with the immature voice or the one with the immature face? In similar research contrasting young children’s cognitive and facial cues, it was found that, when these two cues were presented in a competitive manner as in the aforementioned example, cognitive cues were more influential on decisions than facial cues for both adults and older adolescents (Hernández Blasi & Bjorklund, 2018; Hernández Blasi et al., 2015). Conversely, it has been typically reported in the literature that when both facial and vocal cues are available, facial cues usually take the lead when it comes to the recognition of human emotions (e.g., Kappas et al., 1991).

Finally, even more important than disentangling which of young children’s cues (vocal, physical, and cognitive) might be more powerful and/or “honest” (in a Zahavian sense; e.g., Zahavi, 1987) is unravelling to what extent these different cues provide to caregivers redundant or complementary information about children. Multimodal signalling—that is, conveying information about an underlying trait (e.g., genetic quality) by more than one modality (e.g., faces and voices)—is common in nonhuman species (Smith et al., 2016). According to the redundant signal hypothesis, “multiple cues considered in combination provide a better estimate than any single cue” (Tognetti et al., 2020:824). In this sense, for example, available research on young children’s signalling suggests that vocal, facial, and cognitive cues are likely redundant regarding positive affect. However, when it comes to signalling young children’s helplessness, only vocal and cognitive cues, but not faces, seem to be redundant. Interestingly enough, on the basis of cognitive psychology, clinical neuroscience, and neuroimaging evidence, it seems that in humans some of these systems (e.g., faces and voice processing) develop in parallel, with audiovisual integration of information taking place simultaneously (e.g., Belin et al., 2004, 2011). This parallel development may indicate that these different signalling systems might have evolved in parallel as well. The integration of the information provided by these systems may become critical because “it allows our brain to exploit redundancies between face and voice and combine non-redundant, complementary cues to maximize information gathered from the two modalities” (Belin et al., 2011:719). We do not yet know to what extent children’s vocal, facial, and cognitive cues operate during early childhood as “back up” or redundant signals (conveying similar information about young children), or as “multiple messages” or complementary signals (each cue conveying different information about young children). This is undoubtedly a main challenge for researchers in the near future, and an undertaking that is worth the effort.

To our knowledge this is the first study to report evidence about how children’s voices can influence adults’ reactions toward young children in terms of positive affect, negative affect, intelligence, and helpless appraisals. This study has shown that children with immature voices are perceived more positively and deemed as more helpless and less intelligent than children with mature voices. Overall, this study draws attention to the fact that voice, regardless of the content of speech, is a powerful cue for children’s caregivers, not just during infancy but also during early childhood, becoming one of the richest sources of information for parents and others during the first six years of life.