Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.
翻译:视频生成正从单镜头合成快速演进为复杂的多镜头音视频(MSAV)叙事,以满足真实世界的需求。然而,评估此类前沿模型仍面临根本性挑战。现有基准在覆盖范围和数据多样性上存在局限,且依赖僵化的评估流程,未能实现对现代MSAV模型的系统可靠评价。为弥合这些差距,我们提出了MSAVBench,这是首个面向多镜头音视频生成的全方位基准与自适应混合评估框架。我们的基准涵盖视频、音频、镜头和参考四个关键维度,覆盖多样化的任务设置、最多15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过以下机制提升了鲁棒性:针对镜头分割的自适应自校正机制、面向主观指标的实例级评分规则、以及用于复杂判断的工具化证据提取。此外,MSAVBench与人类判断高度一致,斯皮尔曼等级相关系数达91.5%。我们对19个最先进的闭源与开源模型的系统评估表明,当前系统在导演级控制与精细音视频同步方面仍存在困难,而模块化或智能体式生成流程为缩小开源与闭源模型之间的差距提供了有前景的路径。我们将发布基准数据与评估代码以促进未来研究。