We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network is able to handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function to the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU and can easily handle multiple sources. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible as compared to audio rendered using prior learning-based sound propagation algorithms.
翻译:我们提出了一种端到端的双耳音频渲染方法(Listen2Scene),用于虚拟现实(VR)和增强现实(AR)应用。我们提出了一种基于神经网络的双耳声传播方法,用于为真实环境的3D模型生成声学效果。任何纯净音频或干音频均可与所生成的声学效果进行卷积,以渲染对应真实环境的音频。我们设计了一种图神经网络,该网络利用三维场景的材质与拓扑信息,生成场景隐向量。此外,我们采用条件生成对抗网络(CGAN)从场景隐向量中生成声学效果。该网络能够处理重建三维网格模型中的空洞或其他伪影。我们为生成器网络设计了一种高效代价函数,以融入空间音频效果。给定声源与听者位置,我们基于学习的双耳声传播方法可在NVIDIA GeForce RTX 2080 Ti GPU上于0.1毫秒内生成声学效果,并轻松处理多个声源。我们通过交互式几何声传播算法生成的声学效果以及采集的真实声学效果,评估了本方法的准确性。我们还开展了感知评估,结果表明:与基于先前学习的声传播算法渲染的音频相比,本方法渲染的音频在感知上更具合理性。