In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for text-to-image-generation to generate images conditioned on audio recordings. Using a pre-trained audio encoding model, the proposed method encodes audio into a new token, which can be considered as an adaptation layer between the audio and text representations. Such a modeling paradigm requires a small number of trainable parameters, making the proposed approach appealing for lightweight optimization. Results suggest the proposed method is superior to the evaluated baseline methods, considering objective and subjective metrics. Code and samples are available at: https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken.
翻译:近年来,图像生成在性能上取得了巨大飞跃,其中扩散模型扮演着核心角色。尽管能生成高质量图像,这类模型主要依赖于文本描述条件。这引发了一个问题:“我们如何使这类模型适配其他模态的条件?”本文提出了一种新颖方法,利用为文本到图像生成训练的潜在扩散模型,基于音频记录生成图像。所提方法通过预训练的音频编码模型,将音频编码为一种新令牌,该令牌可视为音频与文本表示之间的适配层。这种建模范式仅需少量可训练参数,使所提方法适用于轻量级优化。结果表明,在客观与主观指标上,所提方法均优于评估的基线方法。代码与样本详见:https://pages.cs.huji.ac.il/adiyoss-lab/AudioToken。