Text-guided image generation has witnessed unprecedented progress due to the development of diffusion models. Beyond text and image, sound is a vital element within the sphere of human perception, offering vivid representations and naturally coinciding with corresponding scenes. Taking advantage of sound therefore presents a promising avenue for exploration within image generation research. However, the relationship between audio and image supervision remains significantly underdeveloped, and the scarcity of related, high-quality datasets brings further obstacles. In this paper, we propose a unified framework 'Align, Adapt, and Inject' (AAI) for sound-guided image generation, editing, and stylization. In particular, our method adapts input sound into a sound token, like an ordinary word, which can plug and play with existing powerful diffusion-based Text-to-Image (T2I) models. Specifically, we first train a multi-modal encoder to align audio representation with the pre-trained textual manifold and visual manifold, respectively. Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly. In this way, we are able to extract the dynamic information of varied sounds, while utilizing the formidable capability of existing T2I models to facilitate sound-guided image generation, editing, and stylization in a convenient and cost-effective manner. The experiment results confirm that our proposed AAI outperforms other text and sound-guided state-of-the-art methods. And our aligned multi-modal encoder is also competitive with other approaches in the audio-visual retrieval and audio-text retrieval tasks.
翻译:文本引导的图像生成因扩散模型的发展而取得了前所未有的进展。除了文本和图像,声音是人类感知领域的关键要素,它能提供生动的表征并自然地与相应场景吻合。因此,利用声音为图像生成研究提供了颇具前景的探索方向。然而,音频与图像监督之间的关系仍显著不成熟,且相关高质量数据集的稀缺带来了进一步障碍。本文提出统一框架“对齐、适配与注入”(AAI),用于声音引导的图像生成、编辑与风格化。具体而言,我们的方法将输入声音适配为声音标记(如同普通词汇),可即插即用于现有强大的基于扩散的文本到图像(T2I)模型。我们首先训练多模态编码器,使音频表征分别与预训练的文本流形和视觉流形对齐;随后,提出音频适配器,将音频表征适配为富含特定语义的音频标记,灵活注入冻结的T2I模型中。由此,我们既能提取多样声音的动态信息,又能利用现有T2I模型的强大能力,以便捷且经济的方式实现声音引导的图像生成、编辑与风格化。实验结果表明,我们提出的AAI优于其他基于文本和声音引导的最新方法,且对齐后的多模态编码器在音视频检索与音频文本检索任务中亦具有竞争力。