How does audio describe the world around us? In this work, we propose a method for generating images of visual scenes from diverse in-the-wild sounds. This cross-modal generation task is challenging due to the significant information gap between auditory and visual signals. We address this challenge by designing a model that aligns audio-visual modalities by enriching audio features with visual information and translating them into the visual latent space. These features are then fed into the pre-trained image generator to produce images. To enhance image quality, we use sound source localization to select audio-visual pairs with strong cross-modal correlations. Our method achieves substantially better results on the VEGAS and VGGSound datasets compared to previous work and demonstrates control over the generation process through simple manipulations to the input waveform or latent space. Furthermore, we analyze the geometric properties of the learned embedding space and demonstrate that our learning approach effectively aligns audio-visual signals for cross-modal generation. Based on this analysis, we show that our method is agnostic to specific design choices, showing its generalizability by integrating various model architectures and different types of audio-visual data.
翻译:音频如何描述我们周围的世界?在本研究中,我们提出了一种从多样化的真实环境声音生成视觉场景图像的方法。由于听觉与视觉信号之间存在显著的信息鸿沟,这一跨模态生成任务具有挑战性。我们通过设计一种模型来解决这一挑战,该模型通过用视觉信息增强音频特征并将其转换到视觉潜在空间来实现视听模态的对齐。这些特征随后被输入预训练的图像生成器以产生图像。为提升图像质量,我们采用声源定位技术筛选具有强跨模态相关性的视听数据对。与先前工作相比,我们的方法在VEGAS和VGGSound数据集上取得了显著更好的结果,并通过对输入波形或潜在空间的简单操作实现了生成过程的控制。此外,我们分析了所学嵌入空间的几何特性,证明我们的学习方法能有效对齐视听信号以进行跨模态生成。基于此分析,我们表明本方法不依赖于特定设计选择,通过整合多种模型架构与不同类型的视听数据,展现了其泛化能力。