Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer which is not limited to the specific domains. Unfortunately, due to the stochastic nature of diffusion models, it is often difficult to maintain the original content of the image during the reverse diffusion. To address this, here we present a novel diffusion-based unsupervised image translation method using disentangled style and content representation. Specifically, inspired by the splicing Vision Transformer, we extract intermediate keys of multihead self attention layer from ViT model and used them as the content preservation loss. Then, an image guided style transfer is performed by matching the [CLS] classification token from the denoised samples and target image, whereas additional CLIP loss is used for the text-driven style transfer. To further accelerate the semantic change during the reverse diffusion, we also propose a novel semantic divergence loss and resampling strategy. Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
翻译:基于语义文本或单张目标图像引导的扩散图像翻译实现了不受特定领域限制的灵活风格迁移。然而,由于扩散模型的随机特性,在反向扩散过程中往往难以保持图像的原始内容。为此,本文提出一种基于解耦风格与内容表示的创新性无监督扩散图像翻译方法。具体而言,受拼接视觉转换器(Splicing Vision Transformer)启发,我们提取ViT模型中多头自注意力层的中间键(intermediate keys)作为内容保持损失。随后,通过匹配去噪样本与目标图像的[CLS]分类令牌实现图像引导风格迁移,而文本驱动风格迁移则额外使用CLIP损失。为进一步加速反向扩散过程中的语义变化,我们还提出新颖的语义散度损失与重采样策略。实验结果表明,本方法在文本引导与图像引导翻译任务中均优于当前最先进的基线模型。