Despite the impressive results of arbitrary image-guided style transfer methods, text-driven image stylization has recently been proposed for transferring a natural image into a stylized one according to textual descriptions of the target style provided by the user. Unlike the previous image-to-image transfer approaches, text-guided stylization progress provides users with a more precise and intuitive way to express the desired style. However, the huge discrepancy between cross-modal inputs/outputs makes it challenging to conduct text-driven image stylization in a typical feed-forward CNN pipeline. In this paper, we present DiffStyler, a dual diffusion processing architecture to control the balance between the content and style of the diffused results. The cross-modal style information can be easily integrated as guidance during the diffusion process step-by-step. Furthermore, we propose a content image-based learnable noise on which the reverse denoising process is based, enabling the stylization results to better preserve the structure information of the content image. We validate the proposed DiffStyler beyond the baseline methods through extensive qualitative and quantitative experiments. Code is available at \url{https://github.com/haha-lisa/Diffstyler}.
翻译:尽管任意图像引导的风格迁移方法已取得显著成果,文本驱动图像风格化技术近期被提出,旨在根据用户提供的目标风格文本描述,将自然图像转化为风格化图像。与传统的图像到图像迁移方法不同,文本引导的风格化过程为用户提供了更精准、更直观的方式来表达期望风格。然而,跨模态输入/输出之间的巨大差异,使得在典型的前馈卷积神经网络流程中进行文本驱动图像风格化面临挑战。本文提出DiffStyler——一种双扩散处理架构,用于控制扩散结果中内容与风格之间的平衡。跨模态风格信息可逐步作为引导轻松集成到扩散过程中。此外,我们提出一种基于内容图像的可学习噪声,反向去噪过程以此噪声为基础,使风格化结果能够更好地保留内容图像的结构信息。通过大量定性与定量实验,我们验证了所提出的DiffStyler优于基线方法。代码已发布于\url{https://github.com/haha-lisa/Diffstyler}。