Text-driven style transfer aims to merge the style of a reference image with content described by a text prompt. Recent advancements in text-to-image models have improved the nuance of style transformations, yet significant challenges remain, particularly with overfitting to reference styles, limiting stylistic control, and misaligning with textual content. In this paper, we propose three complementary strategies to address these issues. First, we introduce a cross-modal Adaptive Instance Normalization (AdaIN) mechanism for better integration of style and text features, enhancing alignment. Second, we develop a Style-based Classifier-Free Guidance (SCFG) approach that enables selective control over stylistic elements, reducing irrelevant influences. Finally, we incorporate a teacher model during early generation stages to stabilize spatial layouts and mitigate artifacts. Our extensive evaluations demonstrate significant improvements in style transfer quality and alignment with textual prompts. Furthermore, our approach can be integrated into existing style transfer frameworks without fine-tuning.
翻译:文本驱动的风格迁移旨在将参考图像的风格与文本提示描述的内容相融合。尽管文本到图像模型的最新进展提升了风格转换的细微表现,但仍存在显著挑战,尤其是对参考风格的过拟合、风格控制受限以及与文本内容错位等问题。本文提出三种互补策略以应对这些挑战。首先,我们引入一种跨模态自适应实例归一化(AdaIN)机制,以更好地融合风格与文本特征,增强对齐效果。其次,我们开发了一种基于风格的无分类器引导(SCFG)方法,实现对风格元素的选择性控制,减少无关影响。最后,我们在早期生成阶段引入教师模型,以稳定空间布局并减少伪影。大量评估结果表明,我们的方法在风格迁移质量及与文本提示的对齐方面均有显著提升。此外,该方法无需微调即可集成到现有风格迁移框架中。