MTAVG-Bench 2.0: Diagnosing Failure Modes of Cinematic Expressiveness in Multi-Talker Audio-Video Generation

Haitian Li,Yanghao Zhou,Heyan Huang,Liangji Chen,YiMing Cheng,Xu Liu,Dian Jin,Jiajun Xu,Jingyun Liao,Tian Lan,Ziqin Zhou,Yueying Liu,Yu Bai,Changsen Yuan,Jinxing Zhou,Xian-Ling Mao,Xuefeng Chen,Yousheng Feng

In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.

翻译：近年来，多说话人音视频生成（MTAVG）模型在唇形同步和视听对齐等基础指标上展现出令人期待的性能。然而，这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中，生成模型必须超越视听真实感，以传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白，我们提出了MTAVG-Bench 2.0，一个用于诊断多说话人音视频生成中电影表现力失效模式的基准测试。与先前主要关注基础多轮对话质量的评估设置不同，MTAVG-Bench 2.0针对短剧和场景级生成，建立了涵盖表演、叙事、氛围及视听语言四个层面的高层失效分类体系。基于该分类体系，我们构建了超过1万个问答评估实例，并附带短剧级评估子集与失效模式时序定位子集，以系统评估全能大语言模型诊断高层视听失效的能力。实验结果表明，Gemini等商业全能模型显著优于其他评估器，但即使是最强模型在我们基准测试的复杂失效场景中仍存在明显不足。这些结果证明，MTAVG-Bench 2.0为电影级多说话人音视频生成的失效诊断提供了系统性基准。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

12+阅读 · 5月21日

[ICML 2026] 诊断与纠正多模态扩散Transformer中的概念遗漏

专知会员服务

6+阅读 · 5月16日

多模态大型语言模型：综述

专知会员服务

47+阅读 · 2025年6月14日

视频生成、理解与流媒体的生成式人工智能和大型语言模型综述

专知会员服务

57+阅读 · 2024年4月27日