Joint audio-video generation models are rapidly approaching professional production quality, raising a central question: do they understand audio-visual physics, or merely generate plausible sounds and frames that violate real-world consistency? We introduce AV-Phys Bench, a benchmark for evaluating physical commonsense in joint audio-video generation. AV-Phys Bench tests models across three scene categories: Steady State, Event Transition, and Environment Transition. It covers physics-grounded subcategories drawn from real-world scenes, plus Anti-AV-Physics prompts that deliberately request physically inconsistent audio-video behavior. Each generation is evaluated along five dimensions: visual semantic adherence, audio semantic adherence, visual physical commonsense, audio physical commonsense, and cross-modal physical commonsense. Across three proprietary and four open-source models, we find that Seedance 2.0 performs best overall, but all models remain far from robust physical understanding. Performance drops sharply on event-driven and environment-driven transitions, and even strong proprietary systems collapse on Anti-AV-Physics prompts. We further introduce AV-Phys Agent, a ReAct-style evaluator that combines a multimodal language model with deterministic acoustic measurement tools, producing rankings that closely align with human ratings. Our results identify cross-modal physical consistency and transition-driven scene dynamics as key open challenges for joint audio-video generation.
翻译:联合音视频生成模型正快速逼近专业制作水准,由此引发核心问题:它们是否真正理解视听物理规律,抑或仅仅生成看似合理却违反现实一致性的音视频片段?为此,我们提出AV-Phys Bench基准测试集,用于评估联合音视频生成中的物理常识。该基准涵盖三类场景:稳态场景、事件过渡场景与环境过渡场景。基准包含源自真实场景的物理驱动子类,以及刻意要求生成违反物理一致性的"反AV物理"提示词。每项生成内容均从五个维度进行评估:视觉语义一致性、听觉语义一致性、视觉物理常识、听觉物理常识及跨模态物理常识。在对三款闭源模型与四款开源模型的评估中,Seedance 2.0综合表现最佳,但所有模型距离稳健的物理理解仍存在显著差距。模型在事件驱动与环境驱动过渡场景中性能急剧下降,即使是强大的闭源系统也在"反AV物理"提示词上彻底失效。我们进一步提出AV-Phys Agent——一种结合多模态语言模型与确定性声学测量工具的ReAct风格评估器,其生成的排序结果与人类评分高度吻合。研究结果表明,跨模态物理一致性与过渡驱动场景动力学是联合音视频生成领域亟待攻克的关键挑战。