Recently, unsupervised image-to-image translation methods based on contrastive learning have achieved state-of-the-art results in many tasks. However, in the previous work, the negatives are sampled from the input image itself, which inspires us to design a data augmentation method to improve the quality of the selected negatives. Moreover, retaining the content similarity via patch-wise contrastive learning in the embedding space, the previous methods ignore the domain consistency between the generated image and the real images of target domain. In this paper, we propose a novel unsupervised image-to-image translation framework based on multi-crop contrastive learning and domain consistency, called MCDUT. Specifically, we obtain the multi-crop views via the center-crop and the random-crop to generate the negatives, which can increase the quality of the negatives. To constrain the embeddings in the deep feature space, we formulate a new domain consistency loss, which encourages the generated images to be close to the real images in the embedding space of same domain. Furthermore, we present a dual coordinate attention network by embedding positional information into channel attention, which called DCA. We employ the DCA network in the design of generator, which makes the generator capture the horizontal and vertical global information of dependency. In many image-to-image translation tasks, our method achieves state-of-the-art results, and the advantages of our method have been proven through extensive comparison experiments and ablation research.
翻译:近期,基于对比学习的无监督图像到图像翻译方法已在众多任务中取得最优结果。然而,先前工作从输入图像本身采样负样本,这促使我们设计一种数据增强方法来提升所选负样本的质量。此外,通过嵌入空间中的分块对比学习保留内容相似性时,先前方法忽略了生成图像与目标域真实图像之间的域一致性。本文提出一种基于多作物对比学习与域一致性的新型无监督图像到图像翻译框架,命名为MCDUT。具体而言,我们通过中心裁剪和随机裁剪获取多作物视图以生成负样本,从而提高负样本质量。为约束深度特征空间中的嵌入,我们提出了新的域一致性损失函数,促使生成图像在相同域的嵌入空间中与真实图像接近。同时,我们通过将位置信息嵌入通道注意力机制,提出双坐标注意力网络(简称DCA)。在生成器设计中采用DCA网络,使生成器能够捕获水平与垂直方向的全局依赖信息。在多项图像到图像翻译任务中,本文方法均取得最优性能,并通过大量对比实验和消融研究验证了其优势。