Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at https://github.com/hqhQAQ/MIP-Adapter.
翻译:个性化文本到图像生成方法能够基于参考图像生成定制化图像,已引起广泛研究关注。近期方法提出采用解耦交叉注意力机制的免微调方案,无需测试时微调即可生成个性化图像。然而,当提供多张参考图像时,现有解耦交叉注意力机制会遭遇对象混淆问题,无法将每张参考图像映射至其对应对象,这严重限制了其应用范围。为解决对象混淆问题,本研究深入探究扩散模型中潜在图像特征的不同位置与目标对象的相关性,据此提出加权融合方法将多张参考图像特征融合至对应对象。随后,我们将此加权融合方法集成至现有预训练模型,并基于开源SA-1B数据集构建的多对象数据集继续训练模型。为缓解对象混淆并降低训练成本,我们提出对象质量评分机制以评估图像质量,从而筛选高质量训练样本。此外,当单个对象具有多张参考图像时,我们的加权融合训练框架也可应用于单对象生成场景。实验验证表明,本方法在Concept101数据集和DreamBooth数据集的多对象个性化图像生成任务中均优于当前最优方法,并在单对象个性化图像生成任务上取得显著性能提升。代码已开源:https://github.com/hqhQAQ/MIP-Adapter。