Personalized text-to-image generation methods can generate customized images based on the reference images, which have garnered wide research interest. Recent methods propose a finetuning-free approach with a decoupled cross-attention mechanism to generate personalized images requiring no test-time finetuning. However, when multiple reference images are provided, the current decoupled cross-attention mechanism encounters the object confusion problem and fails to map each reference image to its corresponding object, thereby seriously limiting its scope of application. To address the object confusion problem, in this work we investigate the relevance of different positions of the latent image features to the target object in diffusion model, and accordingly propose a weighted-merge method to merge multiple reference image features into the corresponding objects. Next, we integrate this weighted-merge method into existing pre-trained models and continue to train the model on a multi-object dataset constructed from the open-sourced SA-1B dataset. To mitigate object confusion and reduce training costs, we propose an object quality score to estimate the image quality for the selection of high-quality training samples. Furthermore, our weighted-merge training framework can be employed on single-object generation when a single object has multiple reference images. The experiments verify that our method achieves superior performance to the state-of-the-arts on the Concept101 dataset and DreamBooth dataset of multi-object personalized image generation, and remarkably improves the performance on single-object personalized image generation. Our code is available at https://github.com/hqhQAQ/MIP-Adapter.
翻译:个性化文本到图像生成方法能够基于参考图像生成定制化图像,已引起广泛研究关注。近期方法提出了一种无需微调的解决方案,采用解耦交叉注意力机制来生成个性化图像,无需在测试时进行微调。然而,当提供多个参考图像时,现有的解耦交叉注意力机制会遇到对象混淆问题,无法将每个参考图像映射到其对应的对象,这严重限制了其应用范围。为解决对象混淆问题,本研究深入探究了扩散模型中潜在图像特征的不同位置与目标对象之间的关联性,并据此提出了一种加权融合方法,将多个参考图像特征融合到对应对象中。随后,我们将该加权融合方法集成到现有的预训练模型中,并在基于开源SA-1B数据集构建的多对象数据集上继续训练模型。为缓解对象混淆并降低训练成本,我们提出了对象质量评分来估计图像质量,以筛选高质量训练样本。此外,当单个对象具有多个参考图像时,我们的加权融合训练框架也可应用于单对象生成任务。实验验证表明,在Concept101数据集和DreamBooth数据集的多对象个性化图像生成任务中,我们的方法取得了优于现有技术的性能,并在单对象个性化图像生成任务中显著提升了性能。我们的代码已发布于https://github.com/hqhQAQ/MIP-Adapter。