We present an end-to-end binaural audio rendering approach (Listen2Scene) for virtual reality (VR) and augmented reality (AR) applications. We propose a novel neural-network-based binaural sound propagation method to generate acoustic effects for 3D models of real environments. Any clean audio or dry audio can be convolved with the generated acoustic effects to render audio corresponding to the real environment. We propose a graph neural network that uses both the material and the topology information of the 3D scenes and generates a scene latent vector. Moreover, we use a conditional generative adversarial network (CGAN) to generate acoustic effects from the scene latent vector. Our network is able to handle holes or other artifacts in the reconstructed 3D mesh model. We present an efficient cost function to the generator network to incorporate spatial audio effects. Given the source and the listener position, our learning-based binaural sound propagation approach can generate an acoustic effect in 0.1 milliseconds on an NVIDIA GeForce RTX 2080 Ti GPU and can easily handle multiple sources. We have evaluated the accuracy of our approach with binaural acoustic effects generated using an interactive geometric sound propagation algorithm and captured real acoustic effects. We also performed a perceptual evaluation and observed that the audio rendered by our approach is more plausible as compared to audio rendered using prior learning-based sound propagation algorithms.
翻译:我们提出一种用于虚拟现实(VR)与增强现实(AR)应用的端到端双耳音频渲染方法(Listen2Scene)。我们提出了一种基于神经网络的新型双耳声传播方法,为真实环境的三维模型生成声学效应。任意纯净音频或干音频均可与所生成的声学效应进行卷积,以渲染对应真实环境的音频。我们提出了一种图神经网络,该网络利用三维场景的材质与拓扑信息,生成场景潜在向量。此外,我们采用条件生成对抗网络(CGAN)从场景潜在向量生成声学效应。我们的网络能够处理重建三维网格模型中的孔洞或其他伪影。我们为生成器网络提出一种高效的代价函数以融入空间音频效应。在给定声源与听者位置的情况下,我们基于学习的双耳声传播方法在NVIDIA GeForce RTX 2080 Ti GPU上可在0.1毫秒内生成声学效应,并能轻松处理多声源场景。我们通过交互式几何声传播算法生成的双耳声学效应与实际采集的声学效应,评估了本方法的准确性。我们还进行了感知评估,观察到相较于先前基于学习的声传播算法所渲染的音频,本方法渲染的音频在感知上更具合理性。