We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we upsample the audio tokens using an additional language model to generate a high-fidelity audio sample. We use the rich semantics of a pre-trained CLIP (Contrastive Language-Image Pre-training) embedding as a visual representation to condition the language model. In addition, to steer the generation process towards the conditioning image, we apply the classifier-free guidance method. Results suggest that Im2Wav significantly outperforms the evaluated baselines in both fidelity and relevance evaluation metrics. Additionally, we provide an ablation study to better assess the impact of each of the method components on overall performance. Lastly, to better evaluate image-to-audio models, we propose an out-of-domain image dataset, denoted as ImageHear. ImageHear can be used as a benchmark for evaluating future image-to-audio models. Samples and code can be found inside the manuscript.
翻译:我们提出Im2Wav,一种图像引导的开放式音频生成系统。给定一张或多张输入图像,Im2Wav能生成语义相关的音频。Im2Wav基于两个Transformer语言模型,对通过VQ-VAE模型获得的分层离散音频表示进行操作。我们首先使用语言模型生成低层音频表示,然后利用额外的语言模型对音频令牌进行上采样,以生成高保真音频样本。我们利用预训练CLIP(对比语言-图像预训练)嵌入的丰富语义作为视觉表示来条件化语言模型。此外,为引导生成过程朝向条件图像,我们应用了无分类器引导方法。结果表明,Im2Wav在保真度和相关性评估指标上均显著优于基线方法。我们还进行消融研究,以更好地评估各方法组件对整体性能的影响。最后,为更好地评估图像到音频模型,我们提出一个域外图像数据集ImageHear,可作为评估未来图像到音频模型的基准。样本和代码见正文。