Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues. This paper focuses on novel audio-visual scene synthesis in the real world. Given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that audio-visual scene. Directly using a NeRF-based model for audio synthesis is insufficient due to its lack of prior knowledge and acoustic supervision. To tackle the challenges, we first propose an acoustic-aware audio generation module that integrates our prior knowledge of audio propagation into NeRF, in which we associate audio generation with the 3D geometry of the visual environment. In addition, we propose a coordinate transformation module that expresses a viewing direction relative to the sound source. Such a direction transformation helps the model learn sound source-centric acoustic fields. Moreover, we utilize a head-related impulse response function to synthesize pseudo binaural audio for data augmentation that strengthens training. We qualitatively and quantitatively demonstrate the advantage of our model on real-world audio-visual scenes. We refer interested readers to view our video results for convincing comparisons.
翻译:人类对复杂世界的感知依赖于多模态信号的全面分析,而音频与视频信号的共现为人类提供了丰富的线索。本文聚焦于真实世界中的新型视听场景合成任务:给定一段视听场景的视频记录,目标是在该场景中沿任意虚拟相机轨迹合成带有空间音频的新视频。由于缺乏先验知识与声学监督,直接使用基于NeRF的模型进行音频合成存在不足。为应对这些挑战,我们首先提出一种声学感知音频生成模块,将音频传播的先验知识融入NeRF,使音频生成与视觉环境的三维几何结构相关联。此外,我们提出一种坐标变换模块,用于表达相对于声源的视线方向——这种方向变换有助于模型学习以声源为中心的声场。进一步地,我们利用头相关脉冲响应函数合成伪双耳音频以增强训练数据。我们在真实视听场景中通过定性与定量实验证明了该模型的优越性。我们建议感兴趣读者查看视频结果以获得更具说服力的对比。