Text-to-image (T2I) customization aims to create images that embody specific visual concepts delineated in textual descriptions. However, existing works still face a main challenge, concept overfitting. To tackle this challenge, we first analyze overfitting, categorizing it into concept-agnostic overfitting, which undermines non-customized concept knowledge, and concept-specific overfitting, which is confined to customize on limited modalities, i.e, backgrounds, layouts, styles. To evaluate the overfitting degree, we further introduce two metrics, i.e, Latent Fisher divergence and Wasserstein metric to measure the distribution changes of non-customized and customized concept respectively. Drawing from the analysis, we propose Infusion, a T2I customization method that enables the learning of target concepts to avoid being constrained by limited training modalities, while preserving non-customized knowledge. Remarkably, Infusion achieves this feat with remarkable efficiency, requiring a mere 11KB of trained parameters. Extensive experiments also demonstrate that our approach outperforms state-of-the-art methods in both single and multi-concept customized generation.
翻译:文本到图像(T2I)定制化旨在生成体现文本描述中特定视觉概念的图像。然而,现有工作仍面临一个主要挑战:概念过拟合。为解决此问题,我们首先分析过拟合,将其分为概念无关过拟合(破坏非定制化概念知识)和概念特定过拟合(局限于有限模态的定制,如背景、布局、风格)。为评估过拟合程度,我们进一步引入两个指标,即潜在Fisher散度和Wasserstein度量,分别用于衡量非定制化与定制化概念的分布变化。基于分析,我们提出Infusion,一种T2I定制化方法,能够学习目标概念以避免受限于有限训练模态,同时保留非定制化知识。值得注意的是,Infusion以仅需11KB可训练参数的高效性实现上述目标。大量实验也表明,我们的方法在单概念和多概念定制化生成中均优于现有最先进方法。