DisenBooth: Identity-Preserving Disentangled Tuning for Subject-Driven Text-to-Image Generation

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to the failure of subject-driven text-to-image generation as follows: (i) the identity-irrelevant information hidden in the entangled embedding may dominate the generation process, resulting in the generated images heavily dependent on the irrelevant information while ignoring the given text descriptions; (ii) the identity-relevant information carried in the entangled embedding can not be appropriately preserved, resulting in identity change of the subject in the generated images. To tackle the problems, we propose DisenBooth, an identity-preserving disentangled tuning framework for subject-driven text-to-image generation. Specifically, DisenBooth finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise each image, DisenBooth instead utilizes disentangled embeddings to respectively preserve the subject identity and capture the identity-irrelevant information. We further design the novel weak denoising and contrastive embedding auxiliary tuning objectives to achieve the disentanglement. Extensive experiments show that our proposed DisenBooth framework outperforms baseline models for subject-driven text-to-image generation with the identity-preserved embedding. Additionally, by combining the identity-preserved embedding and identity-irrelevant embedding, DisenBooth demonstrates more generation flexibility and controllability.

翻译：主题驱动文本到图像生成旨在根据文本描述生成给定主题的定制化图像，近年来受到广泛关注。现有方法主要依赖于微调预训练生成模型，其中身份相关信息（如男孩）与身份无关信息（如背景或姿态）在隐嵌入空间中是纠缠的。然而，高度纠缠的隐嵌入可能导致主题驱动文本到图像生成失败，具体表现为：(i) 纠缠嵌入中隐藏的身份无关信息可能主导生成过程，导致生成的图像过度依赖无关信息而忽略给定的文本描述；(ii) 纠缠嵌入中携带的身份相关信息无法被适当保留，导致生成图像中主题的身份发生改变。为解决这些问题，我们提出DisenBooth——一种面向主题驱动文本到图像生成的保身份解耦微调框架。具体而言，DisenBooth在去噪过程中对预训练扩散模型进行微调。不同于以往利用单一纠缠嵌入对每张图像进行去噪的方法，DisenBooth采用解耦嵌入分别保留主题身份信息与捕获身份无关信息。我们进一步设计了新颖的弱去噪辅助目标与对比嵌入辅助目标来实现这种解耦。大量实验表明，所提出的DisenBooth框架在保身份嵌入的主题驱动文本到图像生成任务中优于基线模型。此外，通过组合保身份嵌入与身份无关嵌入，DisenBooth展现出更强的生成灵活性与可控性。