DisenBooth: Disentangled Parameter-Efficient Tuning for Subject-Driven Text-to-Image Generation

Given a small set of images of a specific subject, subject-driven text-to-image generation aims to generate customized images of the subject according to new text descriptions, which has attracted increasing attention in the community recently. Current subject-driven text-to-image generation methods are mainly based on finetuning a pretrained large-scale text-to-image generation model. However, these finetuning methods map the images of the subject into an embedding highly entangled with subject-identity-unrelated information, which may result in the inconsistency between the generated images and the text descriptions and the changes in the subject identity. To tackle the problem, we propose DisenBooth, a disentangled parameter-efficient tuning framework for subject-driven text-to-image generation. DisenBooth enables generating new images that simultaneously preserve the subject identity and conform to the text descriptions, by disentangling the embedding into an identity-related and an identity-unrelated part. Specifically, DisenBooth is based on the pretrained diffusion models and conducts finetuning in the diffusion denoising process, where a shared identity embedding and an image-specific identity-unrelated embedding are utilized jointly for denoising each image. To make the two embeddings disentangled, two auxiliary objectives are proposed. Additionally, to improve the finetuning efficiency, a parameter-efficient finetuning strategy is adopted. Extensive experiments show that our DisenBooth can faithfully learn well-disentangled identity-related and identity-unrelated embeddings. With the shared identity embedding, DisenBooth demonstrates superior subject-driven text-to-image generation ability. Additionally, DisenBooth provides a more flexible and controllable framework with different combinations of the disentangled embeddings.

翻译：给定特定主体的小规模图像集，主体驱动的文本到图像生成任务旨在根据新文本描述生成该主体的定制化图像，近年来受到学界日益广泛的关注。当前主流方法多基于对预训练大规模文本生成图像模型进行微调，然而这类微调方法将主体图像映射至与主体身份无关信息高度纠缠的嵌入空间，可能导致生成图像与文本描述不一致或主体身份发生改变。针对该问题，本文提出DisenBooth——面向主体驱动文本到图像生成的解耦参数高效微调框架。通过将嵌入分解为身份相关与身份无关两部分，DisenBooth能够在保留主体身份的同时生成符合文本描述的新图像。具体而言，DisenBooth基于预训练扩散模型，在扩散去噪过程中实施微调，联合利用共享身份嵌入与图像特定身份无关嵌入对每张图像进行去噪。为实现两类嵌入的解耦，我们设计了两项辅助目标函数。此外，为提升微调效率，采用了参数高效微调策略。大量实验表明，DisenBooth能够可靠地学习解耦的身份相关与身份无关嵌入。基于共享身份嵌入，DisenBooth展现出卓越的主体驱动文本到图像生成能力；同时，通过解耦嵌入的不同组合，DisenBooth提供了更灵活可控的生成框架。