Text-to-image models (T2I) offer a new level of flexibility by allowing users to guide the creative process through natural language. However, personalizing these models to align with user-provided visual concepts remains a challenging problem. The task of T2I personalization poses multiple hard challenges, such as maintaining high visual fidelity while allowing creative control, combining multiple personalized concepts in a single image, and keeping a small model size. We present Perfusion, a T2I personalization method that addresses these challenges using dynamic rank-1 updates to the underlying T2I model. Perfusion avoids overfitting by introducing a new mechanism that "locks" new concepts' cross-attention Keys to their superordinate category. Additionally, we develop a gated rank-1 approach that enables us to control the influence of a learned concept during inference time and to combine multiple concepts. This allows runtime-efficient balancing of visual-fidelity and textual-alignment with a single 100KB trained model, which is five orders of magnitude smaller than the current state of the art. Moreover, it can span different operating points across the Pareto front without additional training. Finally, we show that Perfusion outperforms strong baselines in both qualitative and quantitative terms. Importantly, key-locking leads to novel results compared to traditional approaches, allowing to portray personalized object interactions in unprecedented ways, even in one-shot settings.
翻译:文本到图像模型(T2I)通过允许用户通过自然语言引导创作过程,提供了新的灵活性水平。然而,将这些模型个性化以适应用户提供的视觉概念仍然是一个具有挑战性的问题。T2I个性化任务面临多个难题,例如在保持高视觉保真度的同时允许创意控制、在单张图像中组合多个个性化概念,以及维持较小的模型规模。我们提出Perfusion,一种基于动态秩一更新底层T2I模型的个性化方法,以解决这些挑战。Perfusion通过引入一种新机制,将新概念的交叉注意力键“锁定”到其上级类别,从而避免过拟合。此外,我们开发了一种门控秩一方法,能够在推理时控制已学习概念的影响,并组合多个概念。这使得通过单个100KB的已训练模型(比当前最先进方法小五个数量级)在运行时高效平衡视觉保真度和文本对齐性成为可能。同时,它无需额外训练即可在帕累托前沿上跨越不同的操作点。最后,我们证明Perfusion在定性和定量方面均优于强基线方法。重要的是,与传统方法相比,键锁定机制带来了新颖的结果,允许以前所未有的方式描绘个性化对象交互,即使在单样本设置中也是如此。