MSAVBench: Towards Comprehensive and Reliable Evaluation of Multi-Shot Audio-Video Generation

Yujie Wei,Yujin Han,Zhekai Chen,Yongming Li,Kaixun Jiang,Zhihang Liu,Quanhao Li,Zhiwu Qing,Xiang Wang,Zhen Xing,Ruihang Chu,Lingyi Hong,Yefei He,Junjie Zhou,Junqiu Yu,Yang Shi,Difan Zou,Kai Zhu,Shiwei Zhang,Yingya Zhang,Yu Liu,Xihui Liu,Hongming Shan

Video generation is rapidly evolving from single-shot synthesis to complex multi-shot audio-video (MSAV) narratives to meet real-world demands. However, evaluating such frontier models remains a fundamental challenge. Existing benchmarks are limited in scope and data diversity, and rely on rigid evaluation pipelines, preventing systematic and reliable assessment of modern MSAV models. To bridge these gaps, we introduce MSAVBench, the first comprehensive benchmark and adaptive hybrid evaluation framework for multi-shot audio-video generation. Our benchmark spans four key dimensions, video, audio, shot, and reference, covering diverse task settings, varying shot counts of up to 15, and challenging non-realistic scenarios. Our evaluation framework improves robustness through an adaptive self-correction mechanism for shot segmentation, instance-wise rubrics for subjective metrics, and tool-grounded evidence extraction for complex judgments. Furthermore, MSAVBench achieves high alignment with human judgments, reaching a Spearman rank correlation of 91.5%. Our systematic evaluation of 19 state-of-the-art closed- and open-source models shows that current systems still struggle with director-level control and fine-grained audio-visual synchronization, while modular or agentic generation pipelines offer a promising path toward narrowing the gap between open- and closed-source models. We will release the benchmark data and evaluation code to facilitate future research.

翻译：视频生成正从单镜头合成快速演进为复杂的多镜头音视频（MSAV）叙事，以满足真实世界的需求。然而，评估此类前沿模型仍面临根本性挑战。现有基准在覆盖范围和数据多样性上存在局限，且依赖僵化的评估流程，未能实现对现代MSAV模型的系统可靠评价。为弥合这些差距，我们提出了MSAVBench，这是首个面向多镜头音视频生成的全方位基准与自适应混合评估框架。我们的基准涵盖视频、音频、镜头和参考四个关键维度，覆盖多样化的任务设置、最多15个镜头的可变数量以及具有挑战性的非真实场景。我们的评估框架通过以下机制提升了鲁棒性：针对镜头分割的自适应自校正机制、面向主观指标的实例级评分规则、以及用于复杂判断的工具化证据提取。此外，MSAVBench与人类判断高度一致，斯皮尔曼等级相关系数达91.5%。我们对19个最先进的闭源与开源模型的系统评估表明，当前系统在导演级控制与精细音视频同步方面仍存在困难，而模块化或智能体式生成流程为缩小开源与闭源模型之间的差距提供了有前景的路径。我们将发布基准数据与评估代码以促进未来研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

音视频大数据基础模型全面综述

专知会员服务

9+阅读 · 5月7日

【AAAI2026】MoFu：用于多主体视频生成的尺度感知调制与傅里叶融合架构

专知会员服务

9+阅读 · 1月3日

《可控视频生成：综述》

专知会员服务

17+阅读 · 2025年7月24日

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日