Text-guided image generation has advanced rapidly with large-scale diffusion models, yet achieving precise stylization with visual exemplars remains difficult. Existing approaches often depend on task-specific retraining or expensive inversion procedures, which can compromise content integrity, reduce style fidelity, and lead to an unsatisfactory trade-off between semantic prompt adherence and style alignment. In this work, we introduce a training-free framework that reformulates style-guided synthesis as an in-context learning task. Guided by textual semantic prompts, our method concatenates a reference style image with a masked target image, leveraging a pretrained ReFlow-based inpainting model to seamlessly integrate semantic content with the desired style through multimodal attention fusion. We further analyze the imbalance and noise sensitivity inherent in multimodal attention fusion and propose a Dynamic Semantic-Style Integration (DSSI) mechanism that reweights attention between textual semantic and style visual tokens, effectively resolving guidance conflicts and enhancing output coherence. Experiments show that our approach achieves high-fidelity stylization with superior semantic-style balance and visual quality, offering a simple yet powerful alternative to complex, artifact-prone prior methods.
翻译:基于大规模扩散模型的文本引导图像生成已取得快速进展,然而通过视觉示例实现精确的风格化仍然困难。现有方法通常依赖于特定任务的重新训练或昂贵的反演过程,这可能会损害内容完整性、降低风格保真度,并导致语义提示遵循度与风格对齐之间不理想的权衡。在本工作中,我们提出了一种无需训练的框架,将风格引导的合成重新定义为上下文学习任务。在文本语义提示的引导下,我们的方法将参考风格图像与掩码目标图像拼接,利用预训练的基于ReFlow的修复模型,通过多模态注意力融合无缝地整合语义内容与所需风格。我们进一步分析了多模态注意力融合中固有的不平衡性和噪声敏感性,并提出了一种动态语义-风格整合(DSSI)机制,该机制重新权衡文本语义标记与风格视觉标记之间的注意力,有效解决了引导冲突并增强了输出连贯性。实验表明,我们的方法实现了高保真度的风格化,具有优越的语义-风格平衡和视觉质量,为复杂且易产生伪影的现有方法提供了一种简单而强大的替代方案。