How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space.
翻译:声音如何描述我们周围的世界?本文提出了一种从声音生成场景图像的方法。该方法旨在解决视觉与听觉之间通常存在的巨大鸿沟带来的挑战。我们设计了一个模型,通过调度各组件的学习过程来关联视听模态,尽管它们之间存在信息差异。核心思想是通过学习将音频与视觉隐空间对齐,从而用视觉信息丰富音频特征。我们将输入音频转换为视觉特征,然后使用预训练的生成器生成图像。为进一步提升生成图像的质量,我们利用声源定位技术筛选具有强跨模态关联的音频-视觉对。在VEGAS和VGGSound数据集上,我们取得了显著优于先前方法的结果。实验还表明,通过对输入波形或隐空间施加简单操作,我们可以控制模型的预测结果。