Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning.
翻译:心智理论(ToM)评估当前主要依赖于使用被动叙事来测试模型,这些叙事本质上缺乏交互性。我们提出FANToM,这是一个通过在信息不对称对话上下文中进行问答来压力测试心智理论的新基准。该基准借鉴了心理学中的重要理论前提以及评估大型语言模型(LLMs)时必要的实证考量。具体而言,我们设计了多种类型的问题,这些问题要求相同的底层推理过程,以识别LLMs中虚幻或虚假的心智理论能力。我们证明,FANToM对现有最先进的LLMs具有挑战性,即使采用思维链推理或微调,它们的表现也显著低于人类。