BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

Subject-driven text-to-image generation models create novel renditions of an input subject based on text prompts. Existing models suffer from lengthy fine-tuning and difficulties preserving the subject fidelity. To overcome these limitations, we introduce BLIP-Diffusion, a new subject-driven image generation model that supports multimodal control which consumes inputs of subject images and text prompts. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation. We first pre-train the multimodal encoder following BLIP-2 to produce visual representation aligned with the text. Then we design a subject representation learning task which enables a diffusion model to leverage such visual representation and generates new subject renditions. Compared with previous methods such as DreamBooth, our model enables zero-shot subject-driven generation, and efficient fine-tuning for customized subject with up to 20x speedup. We also demonstrate that BLIP-Diffusion can be flexibly combined with existing techniques such as ControlNet and prompt-to-prompt to enable novel subject-driven generation and editing applications. Code and models will be released at https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion. Project page at https://dxli94.github.io/BLIP-Diffusion-website/.

翻译：主体驱动的文本到图像生成模型可根据文本提示生成输入主体的新颖再现。现有模型存在耗时的微调过程以及难以保持主体保真度的问题。为克服这些局限，我们提出BLIP-Diffusion，一种新型主体驱动图像生成模型，支持多模态控制，可同时处理主体图像和文本提示输入。与现有模型不同，BLIP-Diffusion引入了一个预训练的多模态编码器，用于提供主体表征。我们首先遵循BLIP-2框架预训练该多模态编码器，以生成与文本对齐的视觉表征；随后设计主体表征学习任务，使扩散模型能够利用此类视觉表征生成新的主体再现。相较于DreamBooth等方法，本模型可实现零样本主体驱动生成，并通过高效微调实现高达20倍的加速。此外，我们证明BLIP-Diffusion可灵活结合ControlNet、prompt-to-prompt等现有技术，实现新型主体驱动生成与编辑应用。代码与模型将发布于https://github.com/salesforce/LAVIS/tree/main/projects/blip-diffusion，项目主页为https://dxli94.github.io/BLIP-Diffusion-website/。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日