top of page

Cognitive and Social Origins of Human Language: Theory of Mind as a Foundation

Hannah Speechly



The origins of human language remain a central and contested question across cognitive science, linguistics, and anthropology. Although language is widely accepted as a species-specific hallmark of Homo sapiens, scholars continue to debate how it evolved. Increasing evidence suggests that language did not emerge suddenly but instead developed gradually through the interaction of domain-general cognitive capacities and socially-driven communication. At the heart of this evolutionary process lies the emergence of theory of mind - the ability to attribute mental states to others, such as their perceptions, beliefs, desires, and goals - which enabled a uniquely human kind of cooperative and flexible communication. These higher-order cognitive capacities establish a foundation for language as we know it: not just a system of signs organized through syntax, but a powerful tool for sharing intentions, negotiating meaning, and building human culture.


Early theories of language evolution often emphasized a sharp divide between human language and other forms of communication. Most prominently, Noam Chomsky’s discontinuity hypothesis proposed that language emerged suddenly in humans as the result of a specific genetic mutation. This mutation then gave rise to a species- and domain-specific universal grammar: a biologically endowed mental faculty that allows children to effortlessly acquire any natural language regardless of environmental variation (Chomsky, 2010). These views placed language largely outside the bounds of broader cognitive or communicative systems, treating it as a unique computational module with little evolutionary precedent. However, this model faces substantial criticism on both evolutionary and empirical grounds. From an evolutionary perspective, it is unclear how a complex and specialized system like language could have emerged in a fully-formed state through a single mutation without clear intermediate stages (Fitch, 2010; Hurford, 2012). Additionally, empirical research shows that language development depends critically on shared attention, cooperation and rich social interaction - all of which are difficult to reconcile with a purely innate faculty (Tomasello, 2008).


Contemporary cognitive science increasingly supports a continuity model, which sees language as the product of general-purpose cognitive abilities shaped by cultural transmission. For example, Ferrigno et al. (2023) found that non-human primates can process nested, recursive-like sequences, suggesting that the hierarchical structure of syntax may build on pre-linguistic pattern recognition abilities. Similarly, Futrell and Hahn (2023) argue that the architecture of language reflects domain-general constraints such as working memory limitations and processing efficiency, rather than a dedicated language module. These findings resonate with Tomasello’s (2008, 2010b) usage-based account, which frames language as emerging from cognitive capacities like sequencing, categorization, and shared intentionality. Rather than being “hardwired”, linguistic structure arises from the ways in which these abilities interact within a social context.


Further support for the continuity model comes from research on iterated learning, which demonstrates how linguistic structure can emerge through repeated social transmission. In a series of experiments, Kirby et al. (2014) showed that artificial languages passed across generations of learners become more structured and learnable over time. Even without explicit instruction, learners impose regularities on input, mirroring historical linguistic changes such as the regularization of irregular English verbs (e.g., “clomb” becoming “climbed”) or the simplification of case systems, as seen in the transition from Old English (which marked nouns for gender, number, and case) to Modern English’s streamlined morphology. These findings suggest that language structure is not merely inherited, but rather shaped by cognitive biases, social interaction, and repeated learning.


This evidence altogether supports a view of language evolution grounded in cognitive and social capacities that predate Homo sapiens. Rather than emerging fully formed, language likely arose through the recursive interaction of biological constraints, cultural evolution, and social cognition.


Theory of Mind as a Precursor to Language

A key cognitive foundation for language is theory of mind (ToM): the capacity to attribute mental states such as beliefs, desires, and intentions to others. Before the emergence of explicit mental state understanding, humans and other primates appear capable of logical inference, which plays a foundational role in behavior prediction and social reasoning. Disjunctive inference (the ability to rule out one possibility to infer another) supports goal attribution and belief tracking. Call (2004) found that chimpanzees could identify the location of hidden food by excluding empty containers, suggesting reasoning beyond associative learning. Follow-up studies ruled out alternative explanations based on perceptual cues, supporting the view that non-human apes engage in genuine inferential reasoning (Call & Tomasello, 2008). This ability also appears early in human development. 23-month-olds can perform disjunctive inference with minimal linguistic support (Mody & Carey, 2016). However, more robust and flexible reasoning emerges around age three, and is shaped by language exposure and social experience (Leahy & Carey, 2020). These results suggest that while language enriches inferential reasoning, the underlying cognitive mechanisms (such as disjunctive inference) form a key basis for theory of mind by enabling individuals to represent and evaluate others’ beliefs and intentions.


ToM can be usefully divided into “mind-to-world” and “world-to-mind” orientations (Anscombe, 1957; Perner & Roessler, 2010). Mind-to-world states include beliefs, knowledge, and perceptions that aim to match external reality, while world-to-mind states include desires, goals, and intentions that motivate action toward a desired outcome. Both states are essential for language use, as effective communication depends not only on tracking these orientations, but also on domain-general reasoning like logical inference and intention attribution. These capacities enable us to interpret utterances, anticipate responses, and coordinate meaning across diverse social contexts, such as understanding a request (“Can you pass the salt?”) or a belief statement (“I thought class started at 10”).


To better understand how these mentalizing abilities may have evolved, researchers have also examined these orientations in non-human primates. Evidence from great apes suggests that some ToM components have deep evolutionary roots. Chimpanzees and bonobos demonstrate awareness of others’ goals and attentional states. For instance, chimpanzees can distinguish between an experimenter who is unwilling to give them food and one who is unable due to a barrier (Call et al., 2004). They also modify their gestures based on whether their audience is visually attentive, indicating sensitivity to perception and intention (Leavens et al., 2005; Tomasello & Carpenter, 2005). These abilities fall largely within the world-to-mind domain as they involve the attribution of intentions and goals. 


However, current evidence suggests that great apes struggle with the mind-to-world domain, particularly when it comes to representing beliefs that differ from reality or from their own perspective. False belief understanding is a hallmark of mature ToM that remains elusive in non-human primates. In classic false belief tasks like the “Smarties task”, a child is shown a candy box that actually contains pencils. When asked what another person would think is inside the box, children under the age of four typically answer “pencils”, revealing difficulty in decoupling their own knowledge from another’s belief (Mitchell & Lacohée, 1991). Similar tasks adapted for apes using implicit measures such as eye-tracking (as opposed to explicit, verbal measures using language) show limited evidence that they can predict behavior based on false beliefs rather than observable behavioral cues. As such, some scholars propose that apes operate with a minimal or behavior-based ToM, relying on observable cues rather than fully representing others’ mental states (Call & Tomasello, 2008).


Theory of Mind in Developmental Psycholinguistics

Children begin to grasp simple mental states like desires and intentions around age two, such as recognizing that someone might want something different from what they actually do (Miller, 2010). By age three, they begin to recognize that people can hold false beliefs and that these beliefs can guide behavior. By around age four, children reliably pass explicit false belief tasks (Wellman, Cross, & Watson, 2001), and even infants as young as 14 months show implicit sensitivity to others’ beliefs in controlled contexts (Onishi & Baillargeon, 2005). This growing ability to attribute beliefs and desires through both world-to-mind and mind-to-world representations enables more sophisticated social reasoning and forms the foundation for pragmatic language use, such as storytelling, persuasion, and deception.


Crucially, human language allows us to explicitly refer to these mental states. We can talk about what someone thinks, wants, or knows to embed representations of others’ mental states within our utterances. This recursive thinking about others’ intentions is uniquely human and may be what allows us to go beyond merely reacting to others’ behavior. While non-human primates show some precursors to theory of mind, they do not demonstrate the same capacity for nested, symbolic representation of mental states. In contrast, human children use gaze following, joint attention, and goal attribution, emerging even in infancy, to build shared intentionality.


Developmental psycholinguistics further supports this continuity between social cognition and language.  Specifically, infants segment speech by detecting statistical regularities (Saffran et al., 1996) and use cues like eye gaze to link words to referents (Baldwin, 1993). These learning mechanisms are culturally universal and emerge early, suggesting they stem from domain-general cognitive capacities rather than language-specific modules. Importantly, language acquisition unfolds gradually. Children move from single words to two-word utterances and telegraphic speech before mastering the complex syntax and pragmatics of adult speech. This developmental arc mirrors proposed evolutionary stages of language emergence, from gestural or vocal protolanguages such as “babbling” to grammatically rich symbolic systems (Jackendoff, 1999; Tomasello, 2008). Children’s systematic errors, such as overregularization (“goed,” “mouses”) or pragmatic overextension, reflect active rule-seeking behavior shaped by limited input. These learning dynamics parallel the cultural evolution of language itself, where communicative pressures and inferential processes drive structure over generations.


Belief and Desire in ToM-Grounded Language

This raises a key question: are belief and desire understanding, which appear to be absent from great apes given current evidence, necessary for human language? While they may not be essential for basic referential communication, they are indispensable for the kind of flexible, pragmatic, and cooperative language use that defines human interaction. As early hominins navigated increasingly complex social environments, the ability to infer others’ mental states became vital for predicting behavior, coordinating in groups, and maintaining social cohesion. This higher-order social cognition provided a framework for cooperative communication and likely conferred strong evolutionary advantages (Call & Tomasello, 2008; Tomasello, 2008).


Human language is inherently cooperative. Philosopher H.P. Grice (1975) formalized this insight with his cooperative principle, which proposes that speakers and listeners tacitly adhere to shared conversational norms to ensure communication is successful. Grice’s maxims of Quantity (say as much as needed, but no more), Quality (say what is true), Relation (be relevant), and Manner (be clear and orderly), are only effective if speakers assume their interlocutors are intentional agents with shared goals. Interpreting language in light of these maxims requires understanding others’ beliefs, desires, and communicative intentions. Without this mentalistic inference, language would break down into literalism or misunderstanding (Clark, 1996; Grice, 1975).


These foundational social-cognitive abilities, like shared attention, joint intentions, and goal coordination, likely evolved through the pressures of group living. Hominins who could better interpret and cooperate with others were more likely to survive and reproduce. Over time, these interactions formed the scaffolding for increasingly structured and flexible communication (Tomasello, 2014). Language, in this view, is not a standalone system but an extension of preexisting social skills honed over millions of years (Scaife & Bruner, 1975; Tomasello, 2008). The communicative function of language also points to its roots in social cognition. Scott-Phillips et al. (2015) argue that human language is fundamentally ostensive; it allows speakers to intentionally signal their communicative intentions. This emphasis on cooperative communication distinguishes humans from even our closest primate relatives. For example, chimpanzees can produce contextually appropriate vocalizations (Girrard-Butoz et al., 2022; Slocombe & Zuberbühler, 2007), but they do not demonstrate the same inferential, trust-based communication that human language relies on. The ability to coordinate attention, infer intentions, and establish shared meaning likely evolved gradually as hominin social environments became more complex.


As these capacities matured, they were co-opted into increasingly elaborate communicative systems. The growing ability to attribute mental states and respond accordingly enabled humans to use language not only to exchange information, but to coordinate joint action, maintain social cohesion, and transmit cultural knowledge. These functions would have been especially adaptive in the face of environmental challenges and collective problem-solving demands, which required a high degree of cooperation and group coordination (Herrmann et al., 2007; Tomasello, 2010a). This body of evidence supports a view of language as an emergent property of the human social mind. As such, the continuity framework offers a powerful lens for understanding language acquisition, development, and the uniquely human capacity to construct and share meaning.


Toward an Integrated Theory of Language

Taken together, these findings support the continuity theory: language did not emerge from a sudden evolutionary leap, but through the gradual refinement of cognitive, social, and communicative capacities. From a developmental perspective, it is increasingly clear that language was scaffolded by pre-existing capacities such as joint attention, goal attribution, cultural learning, and other capacities that became more complex as early humans navigated increasingly cooperative and socially demanding environments. This perspective does not reject the biological foundations of language. On the contrary, it acknowledges that neural and anatomical adaptations, such as modifications to the vocal tract and brain regions involved in sequencing, memory, and sensorimotor integration, likely co-evolved alongside cultural pressures for more sophisticated forms of communication. Language is not a singular innovation, but rather a product of dynamic interactions between biology, cognition, and culture over evolutionary time.


The continuity framework is valuable precisely because it synthesizes insights from multiple disciplines, such as linking comparative studies of primate communication with research on child development and computational modeling. It also provides a reframing of nativist observations: for instance, the fact that any typically developing child can acquire any natural language may reflect not only an innate learning capacity, but also the essential role of shared intentionality, cooperative motivation, and rich communicative environments in shaping language development. Understanding how language evolved offers more than theoretical insight. It reshapes how we think about development, cognition, and what it means to be human. Language is not simply inherited; it is recreated in every child, every conversation, and every culture. It reflects our capacity not just to speak, but to cooperate, infer, and construct shared meaning across generations. Language is not a sudden evolutionary gift but one of the most remarkable outcomes of our cumulative, social evolution.


References

Anscombe, G. E. M. (1957). Intention. Harvard University Press.

Baldwin, D. A. (1993). Infants’ ability to consult the speaker for clues to word reference. Journal of Child Language, 20(2), 395–418. https://doi.org/10.1017/S0305000900008345

Call, J. (2004). Inferences about the location of food in the great apes (Pan paniscus, Pan troglodytes, Gorilla gorilla, and Pongo pygmaeus). Journal of Comparative Psychology, 118(2), 232–241. https://doi.org/10.1037/0735-7036.118.2.232

Call, J., & Tomasello, M. (2008). Does the chimpanzee have a theory of mind? 30 years later. Trends in Cognitive Sciences, 12(5), 187–192. https://doi.org/10.1016/j.tics.2008.02.010

Chomsky, N. (2010). Some simple evo-devo theses: How true might they be for language? In R. K. Larson, V. Déprez, & H. Yamakido (Eds.), The Evolution of Human Language: Biolinguistic Perspectives (pp. 45–62). Cambridge University Press.

Clark, H. H. (1996). Using Language. Cambridge University Press.

Ferrigno, S., et al. (2023). Recursion in the wild: evidence from a non-human primate. Nature Communications, 14, 2370. https://doi.org/10.1038/s41467-023-37816-y.

Fitch, W. T. (2010). The Evolution of Language. Cambridge University Press.

Futrell, R., & Hahn, M. (2023). The cognitive constraints shaping language structure. Trends in Cognitive Sciences, 27(8), 688–701. https://doi.org/10.1016/j.tics.2023.05.001

Girard-Buttoz, C., Zaccarella, E., Bortolato, T. Friederici, A. D., Wittig, R. M., Crockford, C. (2022). Chimpanzees produce diverse vocal sequences with ordered and recombinatorial properties. Communications Biology. https://doi.org/10.1038/s42003-022-03350-8

Grice, H. P. (1975). Logic and Conversation. In P. Cole & J. L. Morgan (Eds.), Syntax and Semantics, Vol. 3: Speech Acts (pp. 41–58). Academic Press.

Herrmann, E., Call, J., Hernàndez-Lloreda, M. V., Hare, B., & Tomasello, M. (2007). Humans have evolved specialized skills of social cognition: The cultural intelligence hypothesis. Science, 317(5843), 1360-1366. https://doi.org/10.1126/science.1146282

Hurford, J. R. (2012). The Origins of Grammar: Language in the Light of Evolution II. Oxford University Press.

Jackendoff, R. (1999). Possible stages in the evolution of the language capacity. Trends in Cognitive Sciences, 3(7), 272–279. https://doi.org/10.1016/S1364-6613(99)01333-9

Kirby, S., Griffiths, T. L., & Smith, K. (2014). Iterated learning and the evolution of language. Current Opinion in Neurobiology, 28, 108–114. https://doi.org/10.1016/j.conb.2014.07.014

Leahy, B., & Carey, S. (2020). The evolution of reasoning in young children. Child Development Perspectives, 14(2), 109–115.

Leavens, D. A., Hopkins, W. D., & Bard, K. A. (2005). Understanding the point of chimpanzee pointing: Epigenesis and ecological validity. Current Directions in Psychological Science, 14(4), 185–189.

Mitchell, P., & Lacohée, H. (1991). Children’s early understanding of false belief. Cognition, 39(2), 107-127. https://doi.org/10.1016/0010-0277(91)90040-B

Mody, S., & Carey, S. (2016). The emergence of reasoning by exclusion in infancy. Cognition, 152, 150–165.

Onishi, K. H., & Baillargeon, R. (2005). Do 15-month-old infants understand false beliefs? Science, 308(5719), 255–258.

Perner, J., & Roessler, J. (2010). Teleology and causality in children’s theory of mind. In D. D. Hutto & M. Ratcliffe (Eds.), Folk Psychology Re-assessed (pp. 71–90). Springer.

Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month-old infants. Science, 274(5294), 1926–1928. https://doi.org/10.1126/science.274.5294.1926

Scaife, M., & Bruner, J. S. (1975). The capacity for joint visual attention in the infant. Nature, 253(5499), 265–266. https://doi.org/10.1038/253265a0 

Scott-Phillips, T. C., et al. (2015). Pragmatics and the origin of language. Current Biology, 25(23), R1237–R1241.

Slocombe, K. E., & Zuberbühler, K. (2007). Chimpanzees modify recruitment screams as a function of audience composition. Proceedings of the National Academy of Sciences, 104(43), 17228–17233.

Tomasello, M. (2008). Origins of Human Communication. MIT Press.

Tomasello, M. (2010a). Why don’t apes use language? In S. T. Parker & K. L. Gibson (Eds.), The Oxford handbook of comparative evolutionary psychology (pp. 419-436). Oxford University Press.

Tomasello, M. (2010b). Why We Cooperate. MIT Press.

Tomasello, M. (2014). The ultra‐social animal. European Journal of Social Psychology, 44(3), 187–194. https://doi.org/10.1002/ejsp.2015

Wellman, H. M., Cross, D., & Watson, J. (2001). Meta-analysis of theory-of-mind development: The truth about false belief. Child Development, 72(3), 655–684.

 
 
 

Academic Memories
Email: info@academicmemories.com

Instagram: @academic.memories

TikTok: @academicmemories

bottom of page