Recent years have witnessed the strong power of large text-to-image diffusion models for the impressive generative capability to create high-fidelity images. However, it is very tricky to generate desired images using only text prompt as it often involves complex prompt engineering. An alternative to text prompt is image prompt, as the saying goes: "an image is worth a thousand words". Although existing methods of direct fine-tuning from pretrained models are effective, they require large computing resources and are not compatible with other base models, text prompt, and structural controls. In this paper, we present IP-Adapter, an effective and lightweight adapter to achieve image prompt capability for the pretrained text-to-image diffusion models. The key design of our IP-Adapter is decoupled cross-attention mechanism that separates cross-attention layers for text features and image features. Despite the simplicity of our method, an IP-Adapter with only 22M parameters can achieve comparable or even better performance to a fully fine-tuned image prompt model. As we freeze the pretrained diffusion model, the proposed IP-Adapter can be generalized not only to other custom models fine-tuned from the same base model, but also to controllable generation using existing controllable tools. With the benefit of the decoupled cross-attention strategy, the image prompt can also work well with the text prompt to achieve multimodal image generation. The project page is available at \url{https://ip-adapter.github.io}.
翻译:近年来,大型文本到图像扩散模型凭借其生成高保真图像的强大能力备受关注。然而,仅使用文本提示生成所需图像往往需要复杂的提示工程,操作难度较高。文本提示的替代方案是图像提示,正所谓"一图胜千言"。尽管基于预训练模型直接微调的现有方法效果显著,但其计算资源需求大,且与基础模型、文本提示及结构控制手段的兼容性有限。本文提出IP-Adapter——一种高效轻量级的适配器,可为预训练文本到图像扩散模型赋予图像提示能力。其核心设计在于解耦交叉注意力机制:将文本特征与图像特征的交叉注意力层分离。尽管方法简洁,仅含2200万参数的IP-Adapter仍能达到甚至超越完全微调的图像提示模型性能。由于预训练扩散模型被冻结,所提出的IP-Adapter不仅能泛化至基于同一基础模型微调的其他定制模型,还可通过现有可控工具实现可控生成。借助解耦交叉注意力策略的优势,图像提示可与文本提示协同工作,实现多模态图像生成。项目主页详见\url{https://ip-adapter.github.io}。