FastComposer: Tuning-Free Multi-Subject Image Generation with Localized Attention

Diffusion models excel at text-to-image generation, especially in subject-driven generation for personalized images. However, existing methods are inefficient due to the subject-specific fine-tuning, which is computationally intensive and hampers efficient deployment. Moreover, existing methods struggle with multi-subject generation as they often blend features among subjects. We present FastComposer which enables efficient, personalized, multi-subject text-to-image generation without fine-tuning. FastComposer uses subject embeddings extracted by an image encoder to augment the generic text conditioning in diffusion models, enabling personalized image generation based on subject images and textual instructions with only forward passes. To address the identity blending problem in the multi-subject generation, FastComposer proposes cross-attention localization supervision during training, enforcing the attention of reference subjects localized to the correct regions in the target images. Naively conditioning on subject embeddings results in subject overfitting. FastComposer proposes delayed subject conditioning in the denoising step to maintain both identity and editability in subject-driven image generation. FastComposer generates images of multiple unseen individuals with different styles, actions, and contexts. It achieves 300$\times$-2500$\times$ speedup compared to fine-tuning-based methods and requires zero extra storage for new subjects. FastComposer paves the way for efficient, personalized, and high-quality multi-subject image creation. Code, model, and dataset are available at https://github.com/mit-han-lab/fastcomposer.

翻译：扩散模型在文本到图像生成方面表现出色，尤其在面向个性化图像的主题驱动生成中。然而，现有方法因需针对特定主体进行微调而效率低下，计算成本高昂且不利于高效部署。此外，现有方法在多主体生成中常因主体间特征混合而难以胜任。我们提出FastComposer，无需微调即可实现高效、个性化、多主体的文本到图像生成。FastComposer利用图像编码器提取的主体嵌入增强扩散模型中的通用文本条件，仅通过前向传播即可基于主体图像和文本指令生成个性化图像。针对多主体生成中的身份混合问题，FastComposer在训练阶段提出交叉注意力局部化监督，强制参考主体的注意力聚焦于目标图像的正确区域。直接使用主体嵌入作为条件会导致主体过拟合，为此FastComposer在去噪步骤中提出延迟主体条件策略，以在主体驱动图像生成中同时保持身份特征与可编辑性。FastComposer能够生成包含多个未见个体且具有不同风格、动作与上下文的图像。与基于微调的方法相比，其实现300倍至2500倍加速，且无需为新增主体存储额外数据。FastComposer为高效、个性化、高质量的多主体图像创作开辟了新途径。代码、模型与数据集已开源：https://github.com/mit-han-lab/fastcomposer。