We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos.
翻译:我们提出新颖视角声学合成(NVAS)任务:给定从源视点观测到的视觉和声音信息,能否从未曾见过的目标视点合成该场景的声音?我们提出一种神经渲染方法——视觉引导声学合成(ViGAS)网络,该网络通过分析输入的视听线索,学习合成空间任意点的声音。为评估该任务,我们收集了两个首创的大规模多视角视听数据集,一个为合成数据,另一个为真实数据。实验结果表明,我们的模型能够成功推理空间线索,并在两个数据集上合成保真音频。据我们所知,该工作是首个解决新颖视角声学合成任务的问题定义、数据集和方法,其在增强现实/虚拟现实到艺术设计等领域具有令人兴奋的潜在应用。通过本工作的突破,我们认为新颖视角合成的未来在于基于视频的多模态学习。