Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.
翻译:语言是人类合作的基础,不仅促进信息交换,还通过情境背景的共享解读协调行动。本研究探讨生成式智能体模型(GABM)Concordia能否在模拟现实环境中有效建模心智理论(ToM)。具体而言,我们评估该框架能否成功模拟ToM能力,以及GPT-4是否能通过社会语境进行真实推理(而非依赖语言记忆)来执行任务。研究发现一个关键局限:GPT-4经常无法基于信念归因选择行动,表明先前研究中观察到的类ToM能力可能源于浅层统计关联而非真正推理。此外,该模型难以从智能体行动中生成连贯的因果效应,暴露出处理复杂社会互动的困难。这些结果对当前关于LLMs中涌现类ToM能力的论断提出质疑,并凸显建立更严格、基于行动的评估框架的必要性。