Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.
翻译:个性化图像生成旨在忠实保留参考主体的身份特征,同时适配多样化的文本提示。现有基于优化的方法虽能确保高保真度,但计算成本高昂;而基于学习的方法虽提升了效率,却因受干扰因素影响而产生特征纠缠问题。本文提出SpotDiff,一种新颖的基于学习的方法,通过定位并解耦干扰来提取主体特异性特征。该方法利用预训练的CLIP图像编码器及针对姿态与背景的专用专家网络,借助特征空间中的正交性约束实现主体身份的分离。为建立规范化的训练流程,我们构建了SpotDiff10k数据集,该精选数据集包含具有一致姿态与背景变化的样本。实验表明,相较于现有方法,SpotDiff在实现更鲁棒的主体保持与可控编辑的同时,仅需1万训练样本即可达到具有竞争力的性能。