Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.
翻译:心理理论(Theory of Mind, ToM)作为追踪他人认知状态的能力,使人类成为高效的协作者。在智能体交互场景中,AI系统同样需要这种能力,但现有基准多通过直接信念提问测试字面心理理论(literal ToM)。在具身环境中基于隐式信念做出最优行动的能力(即功能心理理论,functional ToM)仍缺乏系统测试。本文提出EnactToM——一个包含300个具身多智能体任务的演进基准,任务设定于具备部分可观测性、私有信息与受限通信的三维家庭环境中。每个任务均经过形式化验证以保证可解性及所需认知深度,随着模型能力提升,系统将自动生成难度递增的新任务。在困难子集上,所有七个前沿模型的功能任务完成率均为0.0%的Pass³,而字面信念探针平均得分45.0%。人工分析显示93%的采样失败案例源于认知协调障碍(包括信息隐瞒、忽略伙伴约束与消息分配错误),这为后续研究提供了具体攻关方向。