Personalizing text-to-image diffusion models involves integrating novel visual concepts from a small set of reference images while retaining the model's original generative capabilities. However, this process often leads to overfitting, where the model ignores the user's prompt and merely replicates the reference images. We attribute this issue to a fundamental misalignment between the true goals of personalization, which are subject fidelity and text alignment, and the training objectives of existing methods that fail to enforce both objectives simultaneously. Specifically, prior approaches often overlook the need to explicitly preserve the pretrained model's output distribution, resulting in distributional drift that undermines diversity and coherence. To resolve these challenges, we introduce a Lipschitz-based regularization objective that constrains parameter updates during personalization, ensuring bounded deviation from the original distribution. This promotes consistency with the pretrained model's behavior while enabling accurate adaptation to new concepts. Furthermore, our method offers a computationally efficient alternative to commonly used, resource-intensive sampling techniques. Through extensive experiments across diverse diffusion model architectures, we demonstrate that our approach achieves superior performance in both quantitative metrics and qualitative evaluations, consistently excelling in visual fidelity and prompt adherence. We further support these findings with comprehensive analyses, including ablation studies and visualizations.
翻译:个性化文本到图像扩散模型需要从少量参考图像中整合新颖视觉概念,同时保持模型的原始生成能力。然而,这一过程常导致过拟合——模型忽略用户提示而仅复制参考图像。我们将此问题归因于个性化的真实目标(主体保真度与文本对齐)与现有方法训练目标之间的根本性错位:现有方法无法同时实现这两个目标。具体而言,先前方法往往忽略显式保留预训练模型输出分布的必要性,导致分布漂移破坏生成多样性与连贯性。为解决这些挑战,我们提出基于Lipschitz的正则化目标函数,在个性化过程中约束参数更新,确保与原分布的偏差有界。该方法既能保持与预训练模型行为的一致性,又能实现对新颖概念的准确适配。此外,我们的方法为常用且资源密集的采样技术提供了计算高效替代方案。通过涵盖多种扩散模型架构的广泛实验证明,本方法在定量指标与定性评估中均取得优异性能,在视觉保真度与提示遵循度上持续表现卓越。我们还通过消融实验与可视化等全面分析进一步验证了这些发现。