MTAVG-Bench: A Diagnostic Benchmark for Multi-Talker Dialogue-Centric Audio-Video Generation

Yang-Hao Zhou,Haitian Li,Rexar Lin,Heyan Huang,Jinxing Zhou,Changsen Yuan,Tian Lan,Ziqin Zhou,Yudong Li,Jiajun Xu,Jingyun Liao,Yi-Ming Cheng,Xuefeng Chen,Xian-Ling Mao,Yousheng Feng

Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, structural failures in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively diagnosed. To address this issue, we introduce MTAVG-Bench, a failure-driven diagnostic benchmark for multi-talker dialogue-centric audio-video generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using mainstream T2AV models with carefully designed prompts, yielding 2.4k manually annotated QA pairs for fine-grained failure diagnosis. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. Built on a hierarchical failure taxonomy and a targeted QA protocol, MTAVG-Bench is primarily designed to evaluate whether proprietary and open-source omni-models can reliably identify failure modes in multi-speaker T2AV outputs. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.

翻译：文本到音视频（T2AV）生成领域的最新进展已使模型能够合成包含多参与者对话的音视频内容。然而，现有评估基准主要针对人工录制视频或单说话人场景设计，导致生成的多说话人对话视频中的结构性故障（如身份漂移、不自然的对话轮换、音视频错位）无法被有效诊断。为解决这一问题，我们提出MTAVG-Bench——一个面向多说话人对话中心音视频生成的故障驱动型诊断基准。MTAVG-Bench通过半自动流程构建，利用主流T2AV模型结合精心设计的提示词生成1.8k个视频，并产生2.4k条人工标注的问答对以实现细粒度故障诊断。该基准从四个层级评估多说话人对话生成：音视频信号保真度、时序属性一致性、社交互动与电影化表达。基于分层故障分类体系与目标化问答协议，MTAVG-Bench主要设计用于评估商业及开源全能模型能否可靠识别多说话人T2AV输出中的故障模式。我们在MTAVG-Bench上对12个商业与开源全能模型进行评测，其中Gemini 3 Pro取得最优综合性能，而领先开源模型在信号保真度与一致性方面仍具竞争力。总体而言，MTAVG-Bench通过细粒度故障分析，实现了严格的模型比较与定向化的视频生成优化。