AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models

Rapid advances in audio-video (AV) generation have enabled high-fidelity synthesis with synchronized sound, particularly for human-related scenarios involving speech and interactions. Yet evaluation for AV generation remains at an early stage, with only a few coarse-grained benchmarks for human-related scenarios and relying on limited preset evaluations with generic multimodal LLMs, leading to inaccurate assessments of model capabilities. To address these issues, we introduce AVBench, a fully automated benchmark tailored for human-centric AV generation. AVBench is built on two key designs for comprehensive and accurate evaluation: (i) Human-centric and fine-grained metrics. AVBench integrates ten evaluation dimensions designed for human-centered real-world scenarios, covering visual quality, audio quality, and multi-level consistency across modalities. These practical metrics capture human-related details that existing benchmarks often overlook. (ii) Specialized evaluators via preference learning. To address the lack of specialized training data, we construct large-scale supervision by transforming real-world videos into diverse training pairs with controlled perturbations. After fine-tuning on this high-quality dataset, the evaluators learn to reliably detect subtle cross-modal inconsistencies. Crucially, instead of producing discrete textual judgment, AVBench derives continuous evaluation scores from the model's prediction confidence on binary decisions. This probabilistic scoring mechanism enables a more reliable assessment than traditional VQA-style evaluation and aligns closely with human judgment. Taken together, AVBench offers automated evaluation for AV generation, demonstrates strong potential for data filtering, and serves as a differentiable reward signal for Reinforcement Learning from Human Feedback (RLHF).

翻译：音视频生成技术的快速发展实现了高保真合成与同步声音，尤其在涉及语音和交互的人类相关场景中表现突出。然而，音视频生成的评估仍处于初级阶段，针对人类场景的粗粒度基准有限，且依赖预设评估与通用多模态大模型，导致对模型能力的评估不准确。为解决这些问题，我们提出AVBench——一个完全自动化的、专为人类中心音视频生成设计的基准。AVBench基于两个关键设计实现全面准确评估：（i）人类中心细粒度指标。AVBench整合了面向人类真实场景的十个评估维度，覆盖视觉质量、音频质量及多模态一致性层级。这些实用指标捕捉了现有基准常忽视的人类相关细节。（ii）基于偏好学习的专用评估器。针对专用训练数据不足的问题，我们通过将真实视频转化为带可控扰动的多样化训练对，构建大规模监督信号。该高质量数据集微调后的评估器能可靠检测细微的跨模态不一致性。关键的是，AVBench摒弃离散文本判断，转而从模型对二元决策的预测置信度中推导连续评估分数。这种概率评分机制比传统VQA式评估更可靠，且与人类判断高度一致。综上，AVBench提供了音视频生成的自动化评估，展现了数据筛选的巨大潜力，并可作为人类反馈强化学习的可微奖励信号。