Image-to-image translation aims to learn a mapping between a source and a target domain, enabling tasks such as style transfer, appearance transformation, and domain adaptation. In this work, we explore a diffusion-based framework for image-to-image translation by adapting Diffusion Transformers (DiT), which combine the denoising capabilities of diffusion models with the global modeling power of transformers. To guide the translation process, we condition the model on image embeddings extracted from a pre-trained CLIP encoder, allowing for fine-grained and structurally consistent translations without relying on text or class labels. We incorporate both a CLIP similarity loss to enforce semantic consistency and an LPIPS perceptual loss to enhance visual fidelity during training. We validate our approach on two benchmark datasets: face2comics, which translates real human faces to comic-style illustrations, and edges2shoes, which translates edge maps to realistic shoe images. Experimental results demonstrate that DiT, combined with CLIP-based conditioning and perceptual similarity objectives, achieves high-quality, semantically faithful translations, offering a promising alternative to GAN-based models for paired image-to-image translation tasks.
翻译:图像到图像转换旨在学习源域与目标域之间的映射关系,以实现风格迁移、外观变换和域适应等任务。本研究探索了一种基于扩散模型的图像到图像转换框架,通过采用扩散Transformer(DiT)——该模型结合了扩散模型的去噪能力与Transformer的全局建模能力。为引导转换过程,我们采用预训练CLIP编码器提取的图像嵌入作为模型条件,从而在不依赖文本或类别标签的情况下实现细粒度且结构一致的转换。我们在训练中同时引入了CLIP相似度损失以增强语义一致性,以及LPIPS感知损失以提升视觉保真度。我们在两个基准数据集上验证了所提方法:face2comics(将真实人脸转换为漫画风格插图)和edges2shoes(将边缘图转换为真实鞋类图像)。实验结果表明,结合CLIP条件化与感知相似度目标的DiT模型能够实现高质量、语义保真的转换,为基于配对数据的图像到图像转换任务提供了优于GAN模型的替代方案。