True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.
翻译:真正通用智能不仅需要物理世界模型,还需要社会世界模型:即推断个体心理状态如何相互作用并凝聚成群体层面结果的能力。尽管个体层面心智理论推理取得了显著进展,但现有的大规模多模态语言模型在这一更广泛的任务中仍表现不佳。集体行为从社会张力、从众动态和结构约束中非线性涌现,这意味着它无法通过简单加总个体意图来恢复。我们提出GroupToM-Bench——首个面向群体层面心智理论的多模态基准测试,其构建基于覆盖微观层面BDI状态(信念、欲望、意图)、中观层面群体张力与结构约束,以及宏观层面结果预测与机制归因的因果链条。为探究这一完整脉络,我们开发了一套七级认知审计框架。实验揭示了当前模型与人类基线之间的差距,凸显了模型在处理社会结构与非线性集体动态方面的失效。