Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation, 3D and video composition. Further, personalized techniques enable appealing customized production of a novel concept given only several images as reference. However, an intriguing problem persists: Is it possible to capture multiple, novel concepts from one single reference image? In this paper, we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this, we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically, we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then, the classes are separated and strengthened following the activation of the cross-attention operation, ensuring comprehensive and self-contained concepts. Additionally, we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together, our proposed method, dubbed DisenDiff, can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly, our proposed techniques are compatible with LoRA and inpainting pipelines, enabling more interactive experiences.
翻译:近年来,大规模文本到图像(T2I)模型的飞速进展在AI生成内容(AIGC)领域实现了前所未有的合成质量,涵盖图像生成、三维建模与视频创作。尽管个性化技术能够仅凭数张参考图像即可实现新颖概念的定制化生成,但一个耐人寻味的问题始终存在:能否从单张参考图像中捕捉多个新颖概念?本文发现,现有方法难以保持与参考图像的视觉一致性,且无法消除概念间的相互干扰。为此,我们提出一种注意力校准机制,以提升T2I模型对概念层面的理解。具体而言,我们首先引入与类别绑定的可学习修饰符来捕获多概念的属性特征,随后通过激活交叉注意力操作分离并增强各类别表征,确保概念完整且自洽;同时抑制不同类别的注意力激活强度,以削弱概念间的相互影响。上述方法被命名为DisenDiff,能够从单张图像中解耦学习多概念,并基于所学概念生成新颖的定制化图像。定性与定量评估表明,本方法全面超越当前最先进技术。更重要的是,所提机制可与LoRA及图像修补管线兼容,从而支持更丰富的交互式体验。