Precise spatial control in diffusion-based style transfer remains challenging. This challenge arises because diffusion models treat style as a global feature and lack explicit spatial grounding of style representations, making it difficult to restrict style application to specific objects or regions. To our knowledge, existing diffusion models are unable to perform true localized style transfer, typically relying on handcrafted masks or multi-stage post-processing that introduce boundary artifacts and limit generalization. To address this, we propose an attention-supervised diffusion framework that explicitly teaches the model where to apply a given style by aligning the attention scores of style tokens with object masks during training. Two complementary objectives, a Focus loss based on KL divergence and a Cover loss using binary cross-entropy, jointly encourage accurate localization and dense coverage. A modular LoRA-MoE design further enables efficient and scalable multi-style adaptation. To evaluate localized stylization, we introduce the Regional Style Editing Score, which measures Regional Style Matching through CLIP-based similarity within the target region and Identity Preservation via masked LPIPS and pixel-level consistency on unedited areas. Experiments show that our method achieves mask-free, single-object style transfer at inference, producing regionally accurate and visually coherent results that outperform existing diffusion-based editing approaches.
翻译:基于扩散的风格迁移方法在精确空间控制方面仍面临挑战。这一挑战源于扩散模型将风格视为全局特征,且缺乏风格表征的显式空间锚定,导致难以将风格应用限制于特定对象或区域。据我们所知,现有扩散模型无法实现真正的局部化风格迁移,通常依赖手工掩码或多阶段后处理,这会引入边界伪影并限制泛化能力。为此,我们提出一种注意力监督的扩散框架,通过在训练中将风格标记的注意力分数与对象掩码对齐,显式指导模型在何处应用给定风格。两个互补的目标函数——基于KL散度的聚焦损失和基于二元交叉熵的覆盖损失——共同促进精确定位与密集覆盖。模块化的LoRA-MoE设计进一步实现了高效可扩展的多风格适配。为评估局部风格化效果,我们提出了区域风格编辑评分,该指标通过目标区域内基于CLIP的相似度衡量区域风格匹配度,并通过掩码LPIPS与未编辑区域的像素级一致性评估身份保持性。实验表明,我们的方法在推理阶段实现了无需掩码的单对象风格迁移,生成的区域精确且视觉连贯的结果优于现有基于扩散的编辑方法。