Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

翻译：语言模型（LMs）心理状态推理的研究，既可能为人类社会认知理论（例如心理状态推理部分源于语言接触的理论）提供启示，也有助于增进我们对语言模型本身的理解。然而，已发表的许多关于语言模型的研究依赖于相对小规模的闭源模型样本，限制了我们严格检验心理学理论及评估语言模型能力的能力。本文通过评估41个开源模型（来自不同模型家族）的心理状态推理行为，复现并拓展了已发表的关于错误信念任务的研究。我们发现，34%的受测语言模型对隐含知识状态表现出敏感性；但与先前研究一致，没有任何模型能完全“解释”人类在该效应上的表现。更大规模的语言模型表现出更高的敏感性，同时也具有更高的心理测量预测力。最后，我们利用语言模型的行为生成并检验了一个关于人类认知的新假设：当知识状态通过非叙实动词（如“约翰认为……”）提示时，人类和语言模型都表现出比间接提示（如“约翰看向……”）时更强的错误信念归因偏向。与知识状态的主要效应（在该效应上人类敏感性超过语言模型）不同，人类知识提示效应的幅度完全落在语言模型效应大小的分布范围内——这表明语言分布统计量原则上可以解释人类后一种效应，但无法解释前一种效应。这些结果证明了使用更大规模的开源语言模型样本来检验人类认知理论及评估语言模型能力的价值。