Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.
翻译:基于主体的图像定制旨在生成既遵循文本指令又保持给定参考主体身份的图像。现有方法,包括测试时微调、基于编码器的方法以及共享注意力空间中的token竞争,存在效率有限、提取的参考特征与生成过程不匹配以及无关信息干扰等问题。为解决这些限制,我们将定制任务建模为通过将参考图像融入文本到图像生成过程所引发的分布偏移,并基于最大熵理论推导出条件注意力分布偏移(Conditional Attention Distribution Shift)公式。基于此公式,我们提出了CustomShift——一种基于Stable Diffusion 3的双分支架构。其中,参考对齐分支(Reference-Alignment Branch)利用参考图像与主体名称之间的自注意力机制实现与潜在表示的逐层对齐,而交叉引导分支(Cross-Guidance Branch)则整合文本和参考线索以指导生成过程。在DreamBooth和Custom101基准上的实验表明,我们的方法一致优于现有最先进方法,在语义保真度和主体一致性之间取得了更好的平衡。