Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.
翻译:近期,大规模文本到图像(T2I)模型取得的令人振奋的进展,解锁了人工智能生成内容(AIGC)前所未有的合成质量,涵盖图像生成、3D及视频合成。此外,个性化技术使得仅需几张参考图像即可实现新颖概念的吸引定制化生成。然而,一个有趣的问题依然存在:是否可能从单张参考图像中捕获多个新颖概念?本文发现现有方法难以保持与参考图像的视觉一致性,且无法消除概念间的交叉影响。为解决此问题,我们提出一种注意力校准机制,以提升T2I模型对概念层面的理解。具体而言,我们首先引入与类别绑定的新可学习修饰符,用于捕获多个概念的属性;随后,依据交叉注意力操作的激活强度对类别进行分离与强化,确保概念的全面性和自包含性;此外,我们抑制不同类别的注意力激活以减轻概念间的相互影响。综合上述技术,所提出的方法(命名为DisenDiff)能够从单张图像中学习解耦的多个概念,并基于所学概念生成新颖的定制化图像。定性与定量评估结果表明,我们的方法优于当前最先进技术。更重要的是,所提技术与LoRA及图像修复管线兼容,可实现更丰富的交互体验。