Recent advances in Video-to-Audio (V2A) generation have achieved impressive perceptual quality and temporal synchronization, yet most models remain appearance-driven, capturing visual-acoustic correlations without considering the physical factors that shape real-world sounds. We present Physics-Aware Video-to-Audio Synthesis (PAVAS), a method that incorporates physical reasoning into a latent diffusion-based V2A generation through the Physics-Driven Audio Adapter (Phy-Adapter). The adapter receives object-level physical parameters estimated by the Physical Parameter Estimator (PPE), which uses a Vision-Language Model (VLM) to infer the moving-object mass and a segmentation-based dynamic 3D reconstruction module to recover its motion trajectory for velocity computation. These physical cues enable the model to synthesize sounds that reflect underlying physical factors. To assess physical realism, we curate VGG-Impact, a benchmark focusing on object-object interactions, and introduce Audio-Physics Correlation Coefficient (APCC), an evaluation metric that measures consistency between physical and auditory attributes. Comprehensive experiments show that PAVAS produces physically plausible and perceptually coherent audio, outperforming existing V2A models in both quantitative and qualitative evaluations. Visit https://physics-aware-video-to-audio-synthesis.github.io for demo videos.
翻译:近期视频到音频(V2A)生成技术虽在感知质量和时间同步上取得显著进展,但多数模型仍受限于表象驱动范式——仅捕捉视觉-声学相关性,而忽略了塑造真实世界声音的物理因素。本文提出物理感知视频到音频合成(PAVAS)方法,通过物理驱动音频适配器(Phy-Adapter)将物理推理融入基于潜扩散的V2A生成过程。该适配器接收由物理参数估计器(PPE)提取的物体级物理参数:PPE利用视觉语言模型(VLM)推断运动物体质量,并通过基于分割的动态三维重建模块恢复物体运动轨迹以计算速度。这些物理线索使模型能够合成反映底层物理因素的声音。为评估物理真实性,我们构建了聚焦物体间交互的基准数据集VGG-Impact,并提出音频-物理相关系数(APCC)作为评估指标,用于衡量物理属性与听觉属性间的一致性。综合实验表明,PAVAS能生成物理合理且感知连贯的音频,在定量与定性评估中均优于现有V2A模型。演示视频请访问 https://physics-aware-video-to-audio-synthesis.github.io