Theory of mind (ToM) evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering. Our benchmark draws upon important theoretical requisites from psychology and necessary empirical considerations when evaluating large language models (LLMs). In particular, we formulate multiple types of questions that demand the same underlying reasoning to identify illusory or false sense of ToM capabilities in LLMs. We show that FANToM is challenging for state-of-the-art LLMs, which perform significantly worse than humans even with chain-of-thought reasoning or fine-tuning.
翻译:心理理论(Theory of Mind, ToM)评估目前主要依赖于使用缺乏交互性的被动叙事来测试模型。我们提出FANToM,一个旨在通过问答方式在信息不对称对话情境中对心理理论进行压力测试的新型基准。该基准借鉴了心理学的重要理论要求和评估大语言模型(LLMs)时的必要实证考量。具体而言,我们设计了多种需要相同底层推理过程的问题类型,以识别LLMs中可能存在的虚假或错误心理理论能力。实验表明,FANToM对当前最先进的LLMs具有挑战性,即便采用思维链推理或微调,其表现仍显著逊于人类。