Large language models (LLMs) have recently shown strong performance on Theory of Mind (ToM) tests, prompting debate about the nature and true performance of the underlying capabilities. At the same time, reasoning-oriented LLMs trained via reinforcement learning with verifiable rewards (RLVR) have achieved notable improvements across a range of benchmarks. This paper examines the behavior of such reasoning models in ToM tasks, using novel adaptations of machine psychological experiments and results from established benchmarks. We observe that reasoning models consistently exhibit increased robustness to prompt variations and task perturbations. Our analysis indicates that the observed gains are more plausibly attributed to increased robustness in finding the correct solution, rather than to fundamentally new forms of ToM reasoning. We discuss the implications of this interpretation for evaluating social-cognitive behavior in LLMs.
翻译:大型语言模型(LLM)近期在心智理论(ToM)测试中表现出强劲性能,引发了关于其底层能力本质与真实表现的争论。与此同时,通过可验证奖励的强化学习(RLVR)训练、以推理为导向的LLM在一系列基准测试中取得了显著进步。本文借助对机器心理学实验的创新性改编及经典基准测试结果,考察了此类推理模型在ToM任务中的行为表现。我们观察到推理模型在面对提示变化与任务扰动时始终表现出更强的鲁棒性。分析表明,所观察到的性能提升更可能源于模型寻找正确解决方案的鲁棒性增强,而非源于其具备了全新形式的ToM推理能力。本文进一步讨论了这一解读对评估LLM社会认知行为的重要意义。