Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.
翻译:主体驱动的文本到图像生成模型可根据文本提示生成输入主体的新颖再现。现有模型存在耗时的微调过程以及难以保持主体保真度的问题。为克服这些局限,我们提出BLIP-Diffusion,一种新型主体驱动图像生成模型,支持多模态控制,可同时处理主体图像和文本提示输入。与现有模型不同,BLIP-Diffusion引入了一个预训练的多模态编码器,用于提供主体表征。我们首先遵循BLIP-2框架预训练该多模态编码器,以生成与文本对齐的视觉表征;随后设计主体表征学习任务,使扩散模型能够利用此类视觉表征生成新的主体再现。相较于DreamBooth等方法,本模型可实现零样本主体驱动生成,并通过高效微调实现高达20倍的加速。此外,我们证明BLIP-Diffusion可灵活结合ControlNet、prompt-to-prompt等现有技术,实现新型主体驱动生成与编辑应用。代码与模型将发布于https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion,项目主页为https://dxli94.github.io/BLIP-Diffusion-website/。