Spatial audio is essential for immersive experiences, yet novel-view acoustic synthesis (NVAS) remains challenging due to complex physical phenomena such as reflection, diffraction, and material absorption. Existing methods based on single-view or panoramic inputs improve spatial fidelity but fail to capture global geometry and semantic cues such as object layout and material properties. To address this, we propose Phys-NVAS, the first physics-aware NVAS framework that integrates spatial geometry modeling with vision-language semantic priors. A global 3D acoustic environment is reconstructed from multi-view images and depth maps to estimate room size and shape, enhancing spatial awareness of sound propagation. Meanwhile, a vision-language model extracts physics-aware priors of objects, layouts, and materials, capturing absorption and reflection beyond geometry. An acoustic feature fusion adapter unifies these cues into a physics-aware representation for binaural generation. Experiments on RWAVS demonstrate that Phys-NVAS yields binaural audio with improved realism and physical consistency.
翻译:空间音频对于沉浸式体验至关重要,然而由于反射、衍射和材料吸收等复杂物理现象,新视角声学合成(NVAS)仍具挑战性。现有基于单视图或全景输入的方法虽提升了空间保真度,但未能捕捉全局几何结构及物体布局、材料属性等语义线索。为此,我们提出Phys-NVAS——首个融合空间几何建模与视觉语言语义先验的物理感知NVAS框架。该方法通过多视角图像与深度图重建全局三维声学环境,以估计房间尺寸与形状,从而增强声传播的空间感知。同时,视觉语言模型提取物体、布局及材料的物理感知先验,捕捉超越几何结构的吸收与反射特性。声学特征融合适配器将这些线索统一为物理感知表征,用于双耳音频生成。在RWAVS数据集上的实验表明,Phys-NVAS能生成具有更高真实感与物理一致性的双耳音频。