This paper examines the extent to which large language models (LLMs) have developed higher-order theory of mind (ToM); the human ability to reason about multiple mental and emotional states in a recursive manner (e.g. I think that you believe that she knows). This paper builds on prior work by introducing a handwritten test suite -- Multi-Order Theory of Mind Q&A -- and using it to compare the performance of five LLMs to a newly gathered adult human benchmark. We find that GPT-4 and Flan-PaLM reach adult-level and near adult-level performance on ToM tasks overall, and that GPT-4 exceeds adult performance on 6th order inferences. Our results suggest that there is an interplay between model size and finetuning for the realisation of ToM abilities, and that the best-performing LLMs have developed a generalised capacity for ToM. Given the role that higher-order ToM plays in a wide range of cooperative and competitive human behaviours, these findings have significant implications for user-facing LLM applications.
翻译:本文研究了大型语言模型(LLMs)在多大程度上发展出了高级心理理论(ToM)——即人类以递归方式推理多重心理与情绪状态的能力(例如“我认为你相信她知道”)。本研究在先前工作基础上,引入了一套手写测试集——多阶心理理论问答集——并以此将五种大型语言模型的性能与新收集的成人基准数据进行比较。我们发现,GPT-4 和 Flan-PaLM 在心理理论任务整体上分别达到成人水平和接近成人水平的表现,且 GPT-4 在六阶推理任务上超越了成人表现。研究结果表明,模型规模与微调过程在心理理论能力的实现中存在交互作用,表现最优的大型语言模型已发展出泛化的心理理论能力。鉴于高级心理理论在广泛的人类合作与竞争行为中扮演关键角色,这些发现对面向用户的大型语言模型应用具有重要启示。