Can machines recording an audio-visual scene produce realistic, matching audio-visual experiences at novel positions and novel view directions? We answer it by studying a new task -- real-world audio-visual scene synthesis -- and a first-of-its-kind NeRF-based approach for multimodal learning. Concretely, given a video recording of an audio-visual scene, the task is to synthesize new videos with spatial audios along arbitrary novel camera trajectories in that scene. We propose an acoustic-aware audio generation module that integrates prior knowledge of audio propagation into NeRF, in which we implicitly associate audio generation with the 3D geometry and material properties of a visual environment. Furthermore, we present a coordinate transformation module that expresses a view direction relative to the sound source, enabling the model to learn sound source-centric acoustic fields. To facilitate the study of this new task, we collect a high-quality Real-World Audio-Visual Scene (RWAVS) dataset. We demonstrate the advantages of our method on this real-world dataset and the simulation-based SoundSpaces dataset.
翻译:机器能否通过记录视听场景,在新视点和新视角方向上生成与之匹配的真实视听体验?我们通过研究一项新任务——真实世界视听场景合成——以及首个基于NeRF的多模态学习方法对此进行回答。具体而言,给定一段视听场景的视频记录,任务是沿该场景中任意虚拟相机轨迹合成带有空间音频的新视频。我们提出一种声学感知音频生成模块,将音频传播的先验知识融入NeRF,隐式地将音频生成与视觉环境的三维几何及材质属性相关联。此外,我们设计了坐标变换模块,将视方向表示为相对于声源的方位,使模型能够学习以声源为中心的声场。为促进该新任务的研究,我们构建了高质量真实世界视听场景(RWAVS)数据集。在该真实数据集与基于仿真的SoundSpaces数据集上,我们验证了方法的优越性。