Align, Adapt and Inject: Sound-guided Unified Image Generation

Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.

翻译：文本引导的图像生成因扩散模型的发展而取得了前所未有的进展。除了文本和图像，声音是人类感知领域的关键要素，它能提供生动的表征并自然地与相应场景吻合。因此，利用声音为图像生成研究提供了颇具前景的探索方向。然而，音频与图像监督之间的关系仍显著不成熟，且相关高质量数据集的稀缺带来了进一步障碍。本文提出统一框架“对齐、适配与注入”（AAI），用于声音引导的图像生成、编辑与风格化。具体而言，我们的方法将输入声音适配为声音标记（如同普通词汇），可即插即用于现有强大的基于扩散的文本到图像（T2I）模型。我们首先训练多模态编码器，使音频表征分别与预训练的文本流形和视觉流形对齐；随后，提出音频适配器，将音频表征适配为富含特定语义的音频标记，灵活注入冻结的T2I模型中。由此，我们既能提取多样声音的动态信息，又能利用现有T2I模型的强大能力，以便捷且经济的方式实现声音引导的图像生成、编辑与风格化。实验结果表明，我们提出的AAI优于其他基于文本和声音引导的最新方法，且对齐后的多模态编码器在音视频检索与音频文本检索任务中亦具有竞争力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日