V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.

翻译：构建基于一组基础模型（FMs）的人工智能系统正成为AI研究的新范式。这些模型从海量数据中习得的表征和生成能力，可轻松适配并迁移至各种下游任务，无需从头进行额外训练。然而，当涉及音频模态时，利用基础模型进行跨模态生成的研究仍相对不足。另一方面，从视觉输入自动生成语义相关的音频是跨模态生成研究的重要问题。为解决视觉到音频（V2A）生成问题，现有方法倾向于从零开始设计并构建复杂的系统，且仅使用中等规模的数据集。本文提出一种通过利用CLIP、CLAP和AudioLDM等基础模型的轻量级解决方案。我们首先探究视觉CLIP模型与听觉CLAP模型潜在空间之间的领域差异，进而提出一种简单而有效的映射机制（V2A-Mapper），通过将视觉输入在CLIP与CLAP空间之间进行转换来弥合领域差异。基于转换后的CLAP嵌入，采用预训练的音频生成基础模型AudioLDM生成高保真且与视觉内容对齐的声音。相比先前方法，我们的方案仅需快速训练V2A-Mapper。我们进一步对V2A-Mapper的选择进行深入分析与大规模实验，结果表明生成式映射器在保真度和多样性（FD指标）上表现更优，而回归式映射器在相关性（CS指标）上略胜一筹。在两个V2A数据集上的客观与主观评估均显示，本文方法较当前最优方法具有显著优势——使用减少86%的参数训练，却在FD和CS指标上分别提升53%和19%。