Recently, the multimedia community has witnessed the rise of diffusion models trained on large-scale multi-modal data for visual content creation, particularly in the field of text-to-image generation. In this paper, we propose a new task for ``stylizing'' text-to-image models, namely text-driven stylized image generation, that further enhances editability in content creation. Given input text prompt and style image, this task aims to produce stylized images which are both semantically relevant to input text prompt and meanwhile aligned with the style image in style. To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images. Moreover, diffusion style and content regularizations are simultaneously introduced to facilitate the learning of this modulation network with these diffusion priors, pursuing high-quality stylized text-to-image generation. Extensive experiments demonstrate the effectiveness of our ControlStyle in producing more visually pleasing and artistic results, surpassing a simple combination of text-to-image model and conventional style transfer techniques.
翻译:近期,多媒体社区见证了基于大规模多模态数据训练的扩散模型在视觉内容创作领域(特别是文本到图像生成)的崛起。本文提出了一项名为"文本驱动风格化图像生成"的新任务,旨在进一步强化内容创作的可编辑性。给定输入文本提示和风格图像,该任务旨在生成与输入文本语义相关且与风格图像风格一致的风格化图像。为此,我们通过升级预训练的文本到图像模型并提出可训练的调制网络,构建了新型扩散模型(ControlStyle),该网络能够融合文本提示和风格图像的多重条件。此外,我们同步引入扩散风格与内容正则化机制,借助扩散先验促进调制网络的学习,以实现高质量的文本驱动风格化图像生成。大量实验证明,ControlStyle在生成更具视觉美感与艺术性的结果方面表现优异,超越了文本到图像模型与常规风格迁移技术的简单组合。