Recent advances in tuning-free personalized image generation based on diffusion models are impressive. However, to improve subject fidelity, existing methods either retrain the diffusion model or infuse it with dense visual embeddings, both of which suffer from poor generalization and efficiency. Also, these methods falter in multi-subject image generation due to the unconstrained cross-attention mechanism. In this paper, we propose MM-Diff, a unified and tuning-free image personalization framework capable of generating high-fidelity images of both single and multiple subjects in seconds. Specifically, to simultaneously enhance text consistency and subject fidelity, MM-Diff employs a vision encoder to transform the input image into CLS and patch embeddings. CLS embeddings are used on the one hand to augment the text embeddings, and on the other hand together with patch embeddings to derive a small number of detail-rich subject embeddings, both of which are efficiently integrated into the diffusion model through the well-designed multimodal cross-attention mechanism. Additionally, MM-Diff introduces cross-attention map constraints during the training phase, ensuring flexible multi-subject image sampling during inference without any predefined inputs (e.g., layout). Extensive experiments demonstrate the superior performance of MM-Diff over other leading methods.
翻译:近期基于扩散模型的无调参个性化图像生成方法取得了令人瞩目的进展。然而,为提升主体保真度,现有方法要么重新训练扩散模型,要么向模型注入密集的视觉嵌入,两者均存在泛化能力差和效率低下的问题。此外,由于交叉注意力机制缺乏约束,这些方法在多主体图像生成任务中表现不佳。本文提出MM-Diff——一个统一且免调参的图像个性化生成框架,可在数秒内生成包含单个或多个主体的高保真图像。具体而言,为同时增强文本一致性与主体保真度,MM-Diff采用视觉编码器将输入图像转化为CLS嵌入与补丁嵌入。一方面利用CLS嵌入增强文本嵌入,另一方面联合补丁嵌入提取少量细节丰富的主体嵌入。通过精心设计的多模态交叉注意力机制,这些嵌入被高效集成至扩散模型中。此外,MM-Diff在训练阶段引入交叉注意力图约束,确保推理阶段无需任何预设输入(如布局)即可灵活采样多主体图像。大量实验证明,MM-Diff的性能显著优于其他主流方法。