In the evolving domain of text-to-image generation, diffusion models have emerged as powerful tools in content creation. Despite their remarkable capability, existing models still face challenges in achieving controlled generation with a consistent style, requiring costly fine-tuning or often inadequately transferring the visual elements due to content leakage. To address these challenges, we propose a novel approach, \ours, to produce a diverse range of images while maintaining specific style elements and nuances. During the denoising process, we keep the query from original features while swapping the key and value with those from reference features in the late self-attention layers. This approach allows for the visual style prompting without any fine-tuning, ensuring that generated images maintain a faithful style. Through extensive evaluation across various styles and text prompts, our method demonstrates superiority over existing approaches, best reflecting the style of the references and ensuring that resulting images match the text prompts most accurately. Our project page is available \href{https://curryjung.github.io/VisualStylePrompt/}{here}.
翻译:在文本到图像生成这一持续发展的领域中,扩散模型已成为内容创作的有力工具。尽管现有模型具有显著能力,但在实现具有一致风格的可控生成方面仍面临挑战,这需要昂贵的微调,或常因内容泄漏而导致视觉元素迁移不充分。为解决这些问题,我们提出了一种新颖方法——\ours,旨在生成多样化的图像,同时保留特定的风格元素与细微特征。在去噪过程中,我们保留原始特征的查询(Query),而在后期自注意力层中将其键(Key)和值(Value)替换为参考特征的对应分量。该方法无需任何微调即可实现视觉风格引导,确保生成图像保持可靠的风格一致性。通过在多种风格与文本提示下的广泛评估,我们的方法在忠实反映参考风格、同时确保生成图像与文本提示最精确匹配方面,展现了优于现有方法的性能。项目页面详见\href{https://curryjung.github.io/VisualStylePrompt/}{此处}。