SVBench: Evaluation of Video Generation Models on Social Reasoning

Recent text-to-video generation models exhibit remarkable progress in visual realism, motion fidelity, and text-video alignment, yet they remain fundamentally limited in their ability to generate socially coherent behavior. Unlike humans, who effortlessly infer intentions, beliefs, emotions, and social norms from brief visual cues, current models tend to render literal scenes without capturing the underlying causal or psychological logic. To systematically evaluate this gap, we introduce the first benchmark for social reasoning in video generation. Grounded in findings from developmental and social psychology, our benchmark organizes thirty classic social cognition paradigms into seven core dimensions, including mental-state inference, goal-directed action, joint attention, social coordination, prosocial behavior, social norms, and multi-agent strategy. To operationalize these paradigms, we develop a fully training-free agent-based pipeline that (i) distills the reasoning mechanism of each experiment, (ii) synthesizes diverse video-ready scenarios, (iii) enforces conceptual neutrality and difficulty control through cue-based critique, and (iv) evaluates generated videos using a high-capacity VLM judge across five interpretable dimensions of social reasoning. Using this framework, we conduct the first large-scale study across seven state-of-the-art video generation systems. Our results reveal substantial performance gaps: while modern models excel in surface-level plausibility, they systematically fail in intention recognition, belief reasoning, joint attention, and prosocial inference.

翻译：近期文本到视频生成模型在视觉真实感、运动保真度以及文本-视频对齐方面展现出显著进展，然而其在生成社会一致性行为方面的能力仍存在根本性局限。与人类能够轻松从简短视觉线索中推断意图、信念、情感及社会规范不同，当前模型往往仅渲染字面场景，而未能捕捉其背后的因果或心理逻辑。为系统性地评估这一差距，我们提出了首个面向视频生成的社会推理基准。该基准基于发展心理学与社会心理学的研究成果，将三十个经典社会认知范式组织为七个核心维度，包括心理状态推断、目标导向行为、联合注意、社会协调、亲社会行为、社会规范以及多智能体策略。为实现这些范式的可操作化，我们开发了一个完全无需训练的基于智能体的流程，该流程能够：（i）提炼每个实验的推理机制，（ii）合成多样化的视频就绪场景，（iii）通过基于线索的批判机制确保概念中立性与难度控制，以及（iv）使用高容量视觉语言模型作为评判者，在五个可解释的社会推理维度上对生成视频进行评估。基于此框架，我们对七种最先进的视频生成系统进行了首次大规模研究。我们的结果揭示了显著的性能差距：尽管现代模型在表层合理性方面表现出色，但它们在意图识别、信念推理、联合注意和亲社会推断方面存在系统性缺陷。