V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models

Building artificial intelligence (AI) systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. Their representative and generative abilities learnt from vast amounts of data can be easily adapted and transferred to a wide range of downstream tasks without extra training from scratch. However, leveraging FMs in cross-modal generation remains under-researched when audio modality is involved. On the other hand, automatically generating semantically-relevant sound from visual input is an important problem in cross-modal generation studies. To solve this vision-to-audio (V2A) generation problem, existing methods tend to design and build complex systems from scratch using modestly sized datasets. In this paper, we propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. We first investigate the domain gap between the latent space of the visual CLIP and the auditory CLAP models. Then we propose a simple yet effective mapper mechanism (V2A-Mapper) to bridge the domain gap by translating the visual input between CLIP and CLAP spaces. Conditioned on the translated CLAP embedding, pretrained audio generative FM AudioLDM is adopted to produce high-fidelity and visually-aligned sound. Compared to previous approaches, our method only requires a quick training of the V2A-Mapper. We further analyze and conduct extensive experiments on the choice of the V2A-Mapper and show that a generative mapper is better at fidelity and variability (FD) while a regression mapper is slightly better at relevance (CS). Both objective and subjective evaluation on two V2A datasets demonstrate the superiority of our proposed method compared to current state-of-the-art approaches - trained with 86% fewer parameters but achieving 53% and 19% improvement in FD and CS, respectively.

翻译：在基础模型（FMs）集合之上构建人工智能（AI）系统正成为AI研究的新范式。这些模型从海量数据中习得的表征能力和生成能力，可以轻松适配并迁移至广泛的跨模态下游任务中，而无需从头进行额外训练。然而，当涉及音频模态时，利用基础模型进行跨模态生成的研究仍不充分。另一方面，从视觉输入自动生成语义相关的声音是跨模态生成研究中的一个重要问题。为解决这一视觉到音频（V2A）生成问题，现有方法倾向于使用中等规模数据集从头设计和构建复杂系统。本文提出了一种轻量级解决方案，通过利用CLIP、CLAP和AudioLDM等基础模型实现该目标。我们首先探究了视觉CLIP模型与听觉CLAP模型潜在空间之间的域间隙，随后提出了一种简单而有效的映射器机制（V2A-Mapper），通过在CLIP与CLAP空间之间转换视觉输入来弥合域间隙。基于转换后的CLAP嵌入，采用预训练的音频生成基础模型AudioLDM生成高保真且与视觉对齐的声音。与先前方法相比，我们的方法仅需快速训练V2A-Mapper。我们进一步分析并开展了关于V2A-Mapper选择的广泛实验，结果表明生成式映射器在保真度和变异性（FD）方面更优，而回归映射器在相关性（CS）上略胜一筹。在两个V2A数据集上的客观与主观评估均表明，与当前最先进方法相比，本文方法以少86%的参数训练量，在FD和CS指标上分别实现了53%和19%的提升。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日