In recent years, Multi-Talker Audio-Video Generation (MTAVG) models have shown promising performance on fundamental metrics such as lip-sync and audio-visual alignment. However, these metrics remain insufficient for assessing cinematic expressiveness in scene-level generation. In multi-character scenes, generation models must go beyond audio-visual realism to convey coherent character performance and other higher-level cinematic qualities. To fill this gap, we introduce MTAVG-Bench 2.0, a benchmark for diagnosing failure modes of cinematic expressiveness in multi-talker audio-video generation. Unlike prior settings that mainly focus on the quality of basic multi-turn dialogue, MTAVG-Bench 2.0 targets short-drama and scene-level generation, and establishes a high-level failure taxonomy spanning acting, narrative, atmosphere, and audio-visual language. Based on this taxonomy, we construct more than 10,000 question-answering evaluation instances, together with subsets for short-drama-level assessment and temporal localization of failure modes, to systematically evaluate the ability of omni large language models to diagnose high-level audio-visual failures. Experimental results show that commercial omni models such as Gemini substantially outperform other evaluators, yet even the strongest models continue to struggle with complex failures in our benchmark. These results demonstrate that MTAVG-Bench 2.0 provides a systematic benchmark for failure diagnosis in cinematic multi-talker audio-video generation.
翻译:近年来,多说话人音视频生成(MTAVG)模型在唇形同步和视听对齐等基础指标上展现出令人期待的性能。然而,这些指标仍不足以评估场景级生成中的电影表现力。在多角色场景中,生成模型必须超越视听真实感,以传达连贯的角色表演及其他更高层次的电影品质。为填补这一空白,我们提出了MTAVG-Bench 2.0,一个用于诊断多说话人音视频生成中电影表现力失效模式的基准测试。与先前主要关注基础多轮对话质量的评估设置不同,MTAVG-Bench 2.0针对短剧和场景级生成,建立了涵盖表演、叙事、氛围及视听语言四个层面的高层失效分类体系。基于该分类体系,我们构建了超过1万个问答评估实例,并附带短剧级评估子集与失效模式时序定位子集,以系统评估全能大语言模型诊断高层视听失效的能力。实验结果表明,Gemini等商业全能模型显著优于其他评估器,但即使是最强模型在我们基准测试的复杂失效场景中仍存在明显不足。这些结果证明,MTAVG-Bench 2.0为电影级多说话人音视频生成的失效诊断提供了系统性基准。