Comparing Machines and Children: Using Developmental Psychology Experiments to Assess the Strengths and Weaknesses of LaMDA Responses

Developmental psychologists have spent decades devising experiments to test the intelligence and knowledge of infants and children, tracing the origin of crucial concepts and capacities. Moreover, experimental techniques in developmental psychology have been carefully designed to discriminate the cognitive capacities that underlie particular behaviors. We propose that using classical experiments from child development is a particularly effective way to probe the computational abilities of AI models, in general, and LLMs in particular. First, the methodological techniques of developmental psychology, such as the use of novel stimuli to control for past experience or control conditions to determine whether children are using simple associations, can be equally helpful for assessing the capacities of LLMs. In parallel, testing LLMs in this way can tell us whether the information that is encoded in text is sufficient to enable particular responses, or whether those responses depend on other kinds of information, such as information from exploration of the physical world. In this work we adapt classical developmental experiments to evaluate the capabilities of LaMDA, a large language model from Google. We propose a novel LLM Response Score (LRS) metric which can be used to evaluate other language models, such as GPT. We find that LaMDA generates appropriate responses that are similar to those of children in experiments involving social understanding, perhaps providing evidence that knowledge of these domains is discovered through language. On the other hand, LaMDA's responses in early object and action understanding, theory of mind, and especially causal reasoning tasks are very different from those of young children, perhaps showing that these domains require more real-world, self-initiated exploration and cannot simply be learned from patterns in language input.

翻译：发展心理学家数十年来设计了大量实验，用于测试婴幼儿的智能与知识水平，追溯关键概念与能力的起源。此外，发展心理学实验技术经过精心设计，能够区分支撑特定行为的认知能力。我们提出，利用儿童发展领域的经典实验是检验人工智能模型（尤其是大型语言模型LLMs）计算能力的特别有效途径。首先，发展心理学的方法论技术（例如使用新颖刺激控制过往经验，或设置对照条件判断儿童是否依赖简单联想）同样有助于评估LLMs的能力。与此同时，通过这种方式测试LLMs，可以揭示文本编码的信息是否足以支持特定响应，还是这些响应依赖于其他类型的信息（例如来自物理世界探索的信息）。在本研究中，我们改编了经典发展实验以评估Google大型语言模型LaMDA的能力。我们提出了一种新型的LLM响应得分（LRS）指标，该指标可用于评估其他语言模型（如GPT）。研究发现，在涉及社会理解的实验中，LaMDA能生成与儿童相似的恰当响应，这或许证明这些领域的知识是通过语言习得的。然而，在早期物体与动作理解、心理理论（尤其是因果推理任务）中，LaMDA的响应与幼儿的行为存在显著差异，这可能表明这些领域需要更多基于真实世界的自主探索，而无法仅从语言输入的模式中习得。