We introduce a diffusion-based cross-domain image translator in the absence of paired training data. Unlike GAN-based methods, our approach integrates diffusion models to learn the image translation process, allowing for more coverable modeling of the data distribution and performance improvement of the cross-domain translation. However, incorporating the translation process within the diffusion process is still challenging since the two processes are not aligned exactly, i.e., the diffusion process is applied to the noisy signal while the translation process is conducted on the clean signal. As a result, recent diffusion-based studies employ separate training or shallow integration to learn the two processes, yet this may cause the local minimal of the translation optimization, constraining the effectiveness of diffusion models. To address the problem, we propose a novel joint learning framework that aligns the diffusion and the translation process, thereby improving the global optimality. Specifically, we propose to extract the image components with diffusion models to represent the clean signal and employ the translation process with the image components, enabling an end-to-end joint learning manner. On the other hand, we introduce a time-dependent translation network to learn the complex translation mapping, resulting in effective translation learning and significant performance improvement. Benefiting from the design of joint learning, our method enables global optimization of both processes, enhancing the optimality and achieving improved fidelity and structural consistency. We have conducted extensive experiments on RGB$\leftrightarrow$RGB and diverse cross-modality translation tasks including RGB$\leftrightarrow$Edge, RGB$\leftrightarrow$Semantics and RGB$\leftrightarrow$Depth, showcasing better generative performances than the state of the arts.
翻译:本文提出了一种在缺乏配对训练数据情况下的基于扩散模型的跨域图像转换方法。与基于GAN的方法不同,我们的方法集成扩散模型来学习图像转换过程,从而能够对数据分布进行更全面的建模,并提升跨域转换的性能。然而,将转换过程融入扩散过程仍然具有挑战性,因为这两个过程并非精确对齐,即扩散过程作用于含噪信号,而转换过程则针对干净信号进行。因此,近期基于扩散模型的研究采用分离训练或浅层集成来学习这两个过程,但这可能导致转换优化陷入局部极小值,限制了扩散模型的有效性。为解决此问题,我们提出了一种新颖的联合学习框架,将扩散过程与转换过程对齐,从而提升全局最优性。具体而言,我们提出利用扩散模型提取图像分量以表示干净信号,并在此图像分量上执行转换过程,实现了端到端的联合学习。另一方面,我们引入了一个时间依赖的转换网络来学习复杂的转换映射,从而实现有效的转换学习并显著提升性能。得益于联合学习的设计,我们的方法能够实现两个过程的全局优化,增强了最优性,并获得了改进的保真度和结构一致性。我们在RGB$\leftrightarrow$RGB以及多种跨模态转换任务(包括RGB$\leftrightarrow$边缘、RGB$\leftrightarrow$语义和RGB$\leftrightarrow$深度)上进行了大量实验,结果表明我们的方法在生成性能上优于现有先进技术。