For immersive applications, the generation of binaural sound that matches its visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent studies have shown the possibility of using neural networks for synthesizing binaural audio from mono audio by using 2D visual information as guidance. Extending this approach by guiding the audio with 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. We propose Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network and an audio network. The vision network uses 3D sparse convolutions to extract a visual feature from the point cloud scene. Then, the visual feature conditions the audio network, which operates in the waveform domain, to synthesize the binaural version. Results show that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. We also investigate how 3D point cloud attributes, learning objectives, different reverberant conditions, and several types of mono mixture signals affect the binaural audio synthesis performance of Points2Sound for the different numbers of sound sources present in the scene.
翻译:在沉浸式应用中,生成与视觉内容相匹配的双耳声音对于在虚拟环境中为用户提供有意义的体验至关重要。近期研究表明,利用二维视觉信息作为引导,可通过神经网络从单声道音频合成双耳音频。若将这一方法拓展至三维视觉信息引导并在波形域中操作,则可能实现对虚拟音频场景更精确的可听化。我们提出Points2Sound——一种多模态深度学习模型,它通过三维点云场景从单声道音频生成双耳版本。具体而言,Points2Sound由视觉网络和音频网络组成:视觉网络采用三维稀疏卷积从点云场景中提取视觉特征,随后该视觉特征作为条件输入至在波形域中操作的音频网络,以合成双耳版本。实验结果表明,三维视觉信息能够成功引导多模态深度学习模型完成双耳合成任务。我们还研究了三维点云属性、学习目标、不同混响条件以及多种单声道混合信号类型如何影响Points2Sound在场景中存在不同数量声源时的双耳音频合成性能。