Cross-modal Generative Model for Visual-Guided Binaural Stereo Generation

Binaural stereo audio is recorded by imitating the way the human ear receives sound, which provides people with an immersive listening experience. Existing approaches leverage autoencoders and directly exploit visual spatial information to synthesize binaural stereo, resulting in a limited representation of visual guidance. For the first time, we propose a visually guided generative adversarial approach for generating binaural stereo audio from mono audio. Specifically, we develop a Stereo Audio Generation Model (SAGM), which utilizes shared spatio-temporal visual information to guide the generator and the discriminator to work separately. The shared visual information is updated alternately in the generative adversarial stage, allowing the generator and discriminator to deliver their respective guided knowledge while visually sharing. The proposed method learns bidirectional complementary visual information, which facilitates the expression of visual guidance in generation. In addition, spatial perception is a crucial attribute of binaural stereo audio, and thus the evaluation of stereo spatial perception is essential. However, previous metrics failed to measure the spatial perception of audio. To this end, a metric to measure the spatial perception of audio is proposed for the first time. The proposed metric is capable of measuring the magnitude and direction of spatial perception in the temporal dimension. Further, considering its function, it is feasible to utilize it instead of demanding user studies to some extent. The proposed method achieves state-of-the-art performance on 2 datasets and 5 evaluation metrics. Qualitative experiments and user studies demonstrate that the method generates space-realistic stereo audio.

翻译：双耳立体声音频通过模拟人耳接收声音的方式录制，为用户提供沉浸式听觉体验。现有方法利用自编码器直接提取视觉空间信息合成双耳立体声，导致视觉引导表征能力受限。我们首次提出了一种视觉引导的生成对抗方法，用于从单声道音频生成双耳立体声音频。具体而言，我们开发了立体声音频生成模型（SAGM），该模型利用共享的时空视觉信息分别指导生成器和判别器工作。在生成对抗阶段，共享视觉信息交替更新，使生成器和判别器在视觉共享的同时传递各自的引导知识。所提方法学习双向互补视觉信息，增强了视觉引导在生成过程中的表达能力。此外，空间感知是双耳立体声音频的关键属性，因此立体声空间感知评估至关重要。然而，现有指标无法度量音频的空间感知能力。为此，我们首次提出了一种衡量音频空间感知的指标，该指标能够从时间维度测度空间感知的强度和方向。进一步考虑其功能，该指标可在一定程度上替代繁琐的用户研究。所提方法在两个数据集和五项评估指标上均达到当前最优性能。定性实验与用户研究表明，该方法能生成具有空间真实感的立体声音频。