PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie,Wentao Lei,Kai Jiang,Guanjie Huang,Pengfei Zhang,Chunhui Zhang,Fengji Ma,Haoyu He,Han Zhang,Jiangshan He,Jinting Wang,Linghan Fang,Lufei Gao,Orkesh Ablet,Peihua Zhang,Ruolin Hu,Shengyu Li,Weilin Lin,Xiaoyang Feng,Xinyue Yang,Yan Rong,Yanyun Wang,Zihang Shao,Zelin Zhao,Chenxing Li,Shan Yang,Wenfu Wang,Meng Yu,Dong Yu,Li Liu

from arxiv, 6 major physical dimensions, 41 fine-grained test points, 337 groups of variable-controlled test samples, 11,605 newly recorded videos

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

翻译：文本到音频-视频生成在电影制作和世界建模等应用中至关重要。然而，当前模型往往无法生成物理上合理的声音。现有基准测试主要关注音视频时间同步，而很大程度上忽略了对音频物理接地的显式评估，从而限制了物理合理视听生成的研究。为解决此问题，我们提出了PhyAVBench，这是首个系统评估T2AV、图像到音频-视频和视频到音频模型音频物理接地能力的基准。PhyAVBench提供了PhyAV-Sound-11K，这是一个包含25.5小时、11605个可听视频的新数据集，这些视频来自184名参与者，以确保多样性并避免数据泄露。它包含337个配对的提示组，具有受控的物理变化，这些变化驱动声音差异，每组平均以17个视频为基准，涵盖6个音频物理维度和41个细粒度测试点。每个提示对都标注了其声学差异背后的物理因素。重要的是，PhyAVBench利用成对文本提示来评估这种能力。我们将这种评估范式称为音频物理敏感性测试，并引入了一个新的指标——对比物理响应分数，该分数量化了生成视频与其真实世界对应物之间的声学一致性。我们对17个最先进模型进行了全面评估。我们的结果表明，即使是领先的商业模型也难以处理基本的音频物理现象，暴露了超出视听同步之外的关键差距，并指出了未来的研究方向。我们希望PhyAVBench能够成为推进物理接地视听生成的基础。提示、真实数据和生成的视频样本可在https://github.com/imxtx/PhyAVBench获取。