Large Language Models (LLMs) have sparked substantial interest and debate concerning their potential emergence of Theory of Mind (ToM) ability. Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations, which lacks evaluation of machine ToM ability in real-world human interaction scenarios. This poses a pressing demand to develop new real-world scenario benchmarks. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states (i.e., desires, beliefs, and intentions). Our benchmark builds upon the Belief-Desire-Intention (BDI) agent modeling theory and conducts the necessary empirical experiments to evaluate large language models. Our findings demonstrate that NegotiationToM is challenging for state-of-the-art LLMs, as they consistently perform significantly worse than humans, even when employing the chain-of-thought (CoT) method.
翻译:大型语言模型(LLMs)引发了关于其潜在心智理论(ToM)能力的广泛关注与争论。当前心智理论评估主要依赖于机器生成数据或存在捷径与虚假相关性的游戏化设置,缺乏对真实人际互动场景中机器ToM能力的评测。这催生了开发新型现实场景基准的迫切需求。我们提出谈判图灵(NegotiationToM),这是一个面向真实谈判情境的多维心理状态(即欲望、信念与意图)覆盖式压力测试基准。该基准基于信念-欲望-意图(BDI)智能体建模理论,并开展必要的实证实验来评估大语言模型。研究结果表明,谈判图灵对最先进的LLMs具有挑战性,即使采用思维链(CoT)方法,它们的表现仍持续显著低于人类水平。