To what degree should we ascribe cognitive capacities to Large Language Models (LLMs), such as the ability to reason about intentions and beliefs known as Theory of Mind (ToM)? Here we add to this emerging debate by (i) testing 11 base- and instruction-tuned LLMs on capabilities relevant to ToM beyond the dominant false-belief paradigm, including non-literal language usage and recursive intentionality; (ii) using newly rewritten versions of standardized tests to gauge LLMs' robustness; (iii) prompting and scoring for open besides closed questions; and (iv) benchmarking LLM performance against that of children aged 7-10 on the same tasks. We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children. Base-LLMs are mostly unable to solve ToM tasks, even with specialized prompting. We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds: rewarding cooperative communication that takes into account interlocutor and context. We conclude by arguing for a nuanced perspective on ToM in LLMs.
翻译:我们应在多大程度上将认知能力(例如推理意图和信念的能力,即心智理论,ToM)赋予大型语言模型(LLMs)?本文通过以下方式为这一新兴辩论提供补充:(i)在主导性错误信念范式之外,测试11种基础模型与指令微调模型在ToM相关能力上的表现,包括非字面语言使用和递归意向性;(ii)使用新编写的标准化测试版本评估LLMs的鲁棒性;(iii)对开放式问题与封闭式问题分别设置提示和评分机制;以及(iv)将LLM性能与7-10岁儿童在相同任务上的表现进行对比。我们发现,GPT系列的指令微调模型优于其他模型,且常常优于儿童。基础LLMs即使采用专门提示,大多也无法解决ToM任务。我们认为,语言与ToM的协同进化与发展可能有助于解释指令微调的作用:奖励考虑对话者及语境的合作性沟通。最后,我们主张对LLMs中的ToM采取细致入微的视角。