PhyAVBench：面向物理基础文本到音视频生成的挑战性音频物理敏感性基准 (PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation)

Tianxin Xie,Wentao Lei,Guanjie Huang,Pengfei Zhang,Kai Jiang,Chunhui Zhang,Fengji Ma,Haoyu He,Han Zhang,Jiangshan He,Jinting Wang,Linghan Fang,Lufei Gao,Orkesh Ablet,Peihua Zhang,Ruolin Hu,Shengyu Li,Weilin Lin,Xiaoyang Feng,Xinyue Yang,Yan Rong,Yanyun Wang,Zihang Shao,Zelin Zhao,Chenxing Li,Shan Yang,Wenfu Wang,Meng Yu,Dong Yu,Li Liu

from arxiv, 6 major physical dimensions, 50 fine-grained test points, 1,000 groups of variable-controlled test samples

Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.

翻译：文本到音视频（T2AV）生成支撑着虚拟现实、世界建模、游戏和电影制作等众多需要逼真视听内容的应用。然而，现有的T2AV模型仍无法生成物理上合理的声音，这主要源于其对物理原理的理解有限。为评估当前研究进展，我们提出了PhyAVBench，这是一个挑战性的音频物理敏感性基准，旨在系统评估现有T2AV模型的音频物理基础能力。PhyAVBench包含1000组配对的文本提示，这些提示通过受控的物理变量隐含地诱导声音变化，从而能够对模型对底层声学条件变化的敏感性进行细粒度评估。我们将此评估范式称为音频物理敏感性测试（APST）。与先前主要关注音视频同步的基准不同，PhyAVBench明确评估模型对声音生成背后物理机制的理解，涵盖6个主要音频物理维度、4个日常场景（音乐、音效、语音及其混合）以及50个细粒度测试点，范围从声音衍射等基础方面到更复杂的现象，例如亥姆霍兹共振。每个测试点包含多组配对提示，其中每个提示均基于至少20个新录制或收集的真实世界视频，从而最大限度地降低了模型预训练期间数据泄露的风险。提示和视频均通过严格的人工参与纠错和质量控制进行迭代优化，以确保高质量。我们认为，只有真正掌握音频相关物理原理的模型才能生成物理一致的视听内容。我们希望PhyAVBench能推动这一关键但尚未充分探索领域的未来发展。