Recent advances in text-to-audio-video (T2AV) generation have enabled models to synthesize audio-visual videos with multi-participant dialogues. However, existing evaluation benchmarks remain largely designed for human-recorded videos or single-speaker settings. As a result, potential errors that occur in generated multi-talker dialogue videos, such as identity drift, unnatural turn transitions, and audio-visual misalignment, cannot be effectively captured and analyzed. To address this issue, we introduce MTAVG-Bench, a benchmark for evaluating audio-visual multi-speaker dialogue generation. MTAVG-Bench is built via a semi-automatic pipeline, where 1.8k videos are generated using multiple popular models with carefully designed prompts, yielding 2.4k manually annotated QA pairs. The benchmark evaluates multi-speaker dialogue generation at four levels: audio-visual signal fidelity, temporal attribute consistency, social interaction, and cinematic expression. We benchmark 12 proprietary and open-source omni-models on MTAVG-Bench, with Gemini 3 Pro achieving the strongest overall performance, while leading open-source models remain competitive in signal fidelity and consistency. Overall, MTAVG-Bench enables fine-grained failure analysis for rigorous model comparison and targeted video generation refinement.
翻译:文本到音视频生成的最新进展使得模型能够合成包含多参与者对话的视听视频。然而,现有的评估基准在很大程度上仍是为真人录制视频或单说话人场景设计的。因此,生成的多说话人对话视频中可能出现的错误,例如身份漂移、话轮转换不自然以及音视频错位等,无法被有效捕捉和分析。为解决此问题,我们引入了MTAVG-Bench,这是一个用于评估视听多说话人对话生成的基准。MTAVG-Bench通过一个半自动流程构建,其中使用多个流行模型配合精心设计的提示生成了1.8k个视频,并产生了2.4k个手动标注的问答对。该基准从四个层面评估多说话人对话生成:音视频信号保真度、时序属性一致性、社交互动以及电影化表达。我们在MTAVG-Bench上对12个专有和开源全模态模型进行了基准测试,其中Gemini 3 Pro取得了最强的综合性能,而领先的开源模型在信号保真度和一致性方面仍具有竞争力。总体而言,MTAVG-Bench能够进行细粒度的故障分析,以支持严格的模型比较和有针对性的视频生成优化。