Human perception of the complex world relies on a comprehensive analysis of multi-modal signals, and the co-occurrences of audio and video signals provide humans with rich cues. This paper focuses on novel audio-visual scene synthesis in the real world. Given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that audio-visual scene. Directly using a NeRF-based model for audio synthesis is insufficient due to its lack of prior knowledge and acoustic supervision. To tackle the challenges, we first propose an acoustic-aware audio generation module that integrates our prior knowledge of audio propagation into NeRF, in which we associate audio generation with the 3D geometry of the visual environment. In addition, we propose a coordinate transformation module that expresses a viewing direction relative to the sound source. Such a direction transformation helps the model learn sound source-centric acoustic fields. Moreover, we utilize a head-related impulse response function to synthesize pseudo binaural audio for data augmentation that strengthens training. We qualitatively and quantitatively demonstrate the advantage of our model on real-world audio-visual scenes. We refer interested readers to view our video results for convincing comparisons.
翻译:人类对复杂世界的感知依赖于多模态信号的综合分析,音视频信号的共现为人类提供了丰富的线索。本文聚焦于真实世界中的新型音视频场景合成。给定一段音视频场景的视频记录,我们的任务是在该音视频场景中,沿任意新型相机轨迹合成带有空间音频的新视频。由于缺乏先验知识和声学监督,直接使用基于NeRF的模型进行音频合成存在不足。为应对这些挑战,我们首先提出一个声学感知的音频生成模块,该模块将音频传播的先验知识融入NeRF,将音频生成与视觉环境的三维几何结构相关联。此外,我们提出一个坐标变换模块,用于表达相对于声源的视角方向。这种方向变换有助于模型学习以声源为中心的声场。同时,我们利用头相关传输函数合成伪双耳音频以增强数据训练。通过定性与定量实验,我们证明了模型在真实世界音视频场景中的优势。诚邀感兴趣的读者观看我们的视频结果以进行令人信服的比较。